Search by job, company or skills

GMI Cloud

Infra Engineer - DevOps and Backend Engineering

This job is no longer accepting applications

new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago

Job Description

Role Overview

We are seeking a talented and highly skilled Infrastructure Backend Engineering Development Engineer to design, build, and maintain the scalable infrastructure that supports GMI AI/ML initiatives. The ideal candidate will have a strong background in cloud computing, distributed systems, and DevOps practices to enable efficient AI infrastructure operations.

Preferred Location: Taipei, US

Responsibilities

  1. Design, implement, and maintain AI/ML infrastructure optimized for large-scale training and inference.
  2. Develop automation pipelines for GPU/CPU resource provisioning and workload scheduling using DevOps best practices and methodologies.
  3. Develop observability and telemetry solutions to pro-actively monitor hardware performance, utilization, and health to ensure cluster reliability and efficiency.
  4. Optimize infrastructure for high-throughput data transfer and low-latency communication.
  5. Manage infrastructure security, access controls, and compliance standards for on-prem GPU cluster environments.
  6. Collaborate with relevant engineering teams to configure and troubleshoot GPU clusters and hardware resources.
  7. Document infrastructure architecture, deployment procedures, automation workflow, and operational best practices.
  8. Stay current with the latest GPU technology developments, infrastructure engineering and integrate new hardware/software solutions as appropriate.

Qualifications

  1. Bachelor's degree in Computer Science or related field.
  2. Proficiency in at least one programming language (Golang, Python, Bash) with strong coding practices and system design skills.
  3. Extensive experience with infrastructure orchestration platforms, especially OpenStack and Kubernetes.
  4. Strong proficiency in automation, configuration management, and CI/CD pipelines using Ansible, Jenkins, GitLab CI, or similar.
  5. Proven experience implementing telemetry and observability solutions (e.g. Prometheus, Grafana and related technologies).
  6. Knowledge of networking, security, and performance tuning in GPU clusters.
  7. Hands-on experience in deploying GPU clusters and managing GPU workloads.
  8. Experience with secret management using HashiCorp Vault.
  9. Strong system thinking and abstraction skills, capable of designing complex distributed systems from an end-to-end perspective.
  10. Strong understanding of DevOps principles, automation, and cloud-native architectures.

Meeting every qualification is not requiredif you're excited about this role, we'd love to hear from you. We believe diverse perspectives and experiences strengthen our team.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 141439637