Responsibilities
1. Design and develop large-scale pre-training data processing links to provide stable and reliable high-quality data processing capabilities for base model pre-training, including data sourcing, data capture/collection, data analysis (OCR, pictures, web pages) and other work contents 2. Design and develop a data platform that serves large model pre-training, and manages data life cycle elements such as meta-information, lineage, and storage management of data provides visualization and observability capabilities of pre-training data explores the engineering upper limit of data experiments and data release 3. Constructs data synthesis solutions and frameworks for models such as LLM and VLM to support data scale and other work 4. Based on the characteristics of large model training data, abstract and develop an efficient and reliable data processing framework to improve the engineering efficiency of all large model algorithm engineers in processing data.
Qualifications
1. Familiarity with at least one programming language, such as Go, Python, Java, etc. 2. Bonus points for having an in-depth understanding of big data technology, and bonus points for being proficient in tools such as Spark, Flink, Kafka, Hive, HDFS, etc. 3. Bonus points for having system platform development and in-depth usage experience related to data center and machine learning 4. Bonus points for having an in-depth understanding of large model technology and product ecology 5. Have enthusiasm for facing technical challenges, be able to think independently, be curious and have the ability to learn quickly.