Overall introduction: Responsible for the resources/matching/arrangement of machine learning systems for ByteDance, recommendation/advertising/search businesses, as well as the design and development of business scenario scheduling systems, serving model training, model evaluation and scenarios in all directions. Model reasoning. The system involves the following work, and you can participate in at least one of them: 1. Distributed scheduling layer, which solves the distributed deployment of single services: (a) Use/secondary development of distributed scheduling frameworks such as Kubernetes, Yarn, Mesos, Celery, etc., and Can reasonably select models in different business scenarios optimize cluster utilization/uniformity scheduling strategies based on the characteristics of each framework (b) dock/extend the work of each framework in horizontal/vertical expansion and even AutoScaling participate in multiple clusters Adaptation work for hybrid scheduling (similar to FedK8s) Responsible for the preemption/eviction function of different priority services Responsible for lending/hybrid docking between different types of resources in different clusters Responsible for scheduling/loading in multi-machine rooms, multi-regions, and multi-cloud scenarios Adaptation 2. The resource matching layer solves the joint allocation problem of resources between multiple roles: optimizing the allocation rate and resource operation efficiency from a global perspective solving various CPU/GPU/other heterogeneous hardware/model data/sample data /Capacity coordination and joint matching between externally called resources sensing topology limitations and performing micro-topology optimization to optimize overall network bandwidth usage budget/delivery linkage of massive resources and multi-tenants guaranteed resources/extra-budget resources, and co-location / Oversold scenario docking 3. Participate in the process/functional requirements of the training scenario, such as staged orchestration and batch flow stage encapsulation improve the stability of the training single copy service, such as Failover protection more backup point strategies observable functionality, operability, and user experience optimization 4. Participation includes offline to online collaboration , optimization of data consistency and update timeliness it also includes traffic scheduling of heterogeneous resources/stability plans for multiple copies of online services dynamic online orchestration of models and services and inter-cluster location orchestration.
1. Proficient in Go/Python under Linux environment, which is only one less programming language, Hands-on has excellent coding ability 2. Familiar with some open source distributed scheduling frameworks, such as Kubernetes (K8S), Yarn (Flink, MapReduce), Mesos, Celery, and have rich experience in machine learning system practice and development 3. Master the principles of distributed systems and participate in large projects Design, development and maintenance of large-scale distributed systems 4. Have excellent logical analysis capabilities and be able to reasonably abstract and split business logic 5. Have a strong sense of work responsibility, good learning ability, communication skills and Self-motivated and able to respond and act quickly 6. Have good working documentation habits and promptly write and update work processes and technical documents as required. Bonus points: Applicants who meet any of the bonus points will be given priority 1. Familiar with at least one mainstream machine learning framework (TensorFlow / PyTorch) 2. Have experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking) 3. Have experience in using/designing some open source training orchestration systems: TFX.