Role Overview
We are seeking a talented and highly skilled Infrastructure Backend Engineering Development Engineer to design, build, and maintain the scalable infrastructure that supports GMI AI/ML initiatives. The ideal candidate will have a strong background in cloud computing, distributed systems, and DevOps practices to enable efficient AI infrastructure operations.
Preferred Location: Taipei, US
Responsibilities
- Design, implement, and maintain AI/ML infrastructure optimized for large-scale training and inference.
- Develop automation pipelines for GPU/CPU resource provisioning and workload scheduling using DevOps best practices and methodologies.
- Develop observability and telemetry solutions to pro-actively monitor hardware performance, utilization, and health to ensure cluster reliability and efficiency.
- Optimize infrastructure for high-throughput data transfer and low-latency communication.
- Manage infrastructure security, access controls, and compliance standards for on-prem GPU cluster environments.
- Collaborate with relevant engineering teams to configure and troubleshoot GPU clusters and hardware resources.
- Document infrastructure architecture, deployment procedures, automation workflow, and operational best practices.
- Stay current with the latest GPU technology developments, infrastructure engineering and integrate new hardware/software solutions as appropriate.
Qualifications
- Bachelor's degree in Computer Science or related field.
- Proficiency in at least one programming language (Golang, Python, Bash) with strong coding practices and system design skills.
- Extensive experience with infrastructure orchestration platforms, especially OpenStack and Kubernetes.
- Strong proficiency in automation, configuration management, and CI/CD pipelines using Ansible, Jenkins, GitLab CI, or similar.
- Proven experience implementing telemetry and observability solutions (e.g. Prometheus, Grafana and related technologies).
- Knowledge of networking, security, and performance tuning in GPU clusters.
- Hands-on experience in deploying GPU clusters and managing GPU workloads.
- Experience with secret management using HashiCorp Vault.
- Strong system thinking and abstraction skills, capable of designing complex distributed systems from an end-to-end perspective.
- Strong understanding of DevOps principles, automation, and cloud-native architectures.
Meeting every qualification is not requiredif you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.