【Job Description】
Responsibilities include, but not limited to:
- Architect and execute large-scale custom model training and fine-tuning jobs (SFT, RLHF) on multi-node, multi-GPU clusters.
- Optimize training throughput and memory efficiency using distributed training strategies (FSDP, DeepSpeed, Megatron-LM) and mixed-precision techniques (FP16/BF16).
- Design and develop autonomous AI Agents capable of multi-step reasoning, planning, and tool execution to automate complex manufacturing workflows.
- Analyze and profile complex workloads (e.g., LLM training, Rendering pipelines) to identify bottlenecks in compute, memory bandwidth, and latency.
- Write and optimize high-performance kernels using CUDA, HIP, or custom assembly (PTX/SASS) to unlock hardware capabilities.
- Collaborate with Hardware Architects to define features for next-generation GPUs based on workload characterization.
- Design and implement performance regression testing suites to catch degradations in drivers or compilers.
- Mentor junior engineers on parallel programming paradigms and optimization techniques.