AML large model distributed storage Optimization Engineer

0-2 years
2 months ago
Job Description

Responsibilities

1. Build a unified data storage format and reading engine that can support the demands of multiple businesses in different scenarios (low cost/high availability/high throughput/high performance/large capacity/sequential or random access) 2. Targeted at For large model scenarios, build an efficient model parameter management, segmentation, and deduplication system 3. The architectural complexity of multi-level/hierarchical storage: not limited to video memory/memory/external memory 4. Follow up with the evolution of cutting-edge software/hardware architecture and attempts 5. Optimization of multiple goals for multiple subsystems: functionality, availability, and fault tolerance of the training part data consistency, effectiveness, and bandwidth capacity of the system synchronization part 6. Continuous pursuit of some index/storage structures Extreme: Infinite pursuits such as lock-free/progressive data structures.

Qualifications

1. Proficient in the use of C++/Python programming language in Linux environment 2. Master the principles of distributed systems and participate in large-scale Design, development, maintenance and continuous optimization of large-scale distributed systems, and be able to identify potential problems in large and complex distributed systems 3. Participated in system optimization similar to Parameter Server, or index structure optimization of data reading engines or have knowledge of HDFS, Experience in using/optimizing large-scale distributed storage systems such as PFS 4. Have excellent logical analysis capabilities, be able to reasonably abstract and split business logic, and have good teamwork spirit 5. Have a strong sense of work responsibility, preferably Have good learning ability, communication ability, self-driving force and execution ability 6. Have good working documentation habits and promptly write and update work processes and technical documents as required. Bonus points: 1. Understand open source storage/engine projects such as Redis, RocksDB, and Presto understand common machine learning file storage formats such as parquet, TFRecord, IndexRecordIO, etc. 2. Familiar with one of the mainstream machine learning frameworks (TensorFlow/PyTorch / Jax) 3. Have experience in one of the following fields: Database Systems, Distributed Storage, AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking), Machine Learning Frameworks 4. Have in-depth understanding/tracking ability of Linux kernel and operating system 5. Have ACM/OI competition background.

JOB TYPE

Function

Skills

C++
Gpu
Jax
AI Infrastructure
ML Hardware Architecture
Accelerators
HW/SW Co-Design
RocksDB
About
Job Source: jobs.bytedance.com

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

Career Advice to Find Better