AI/ML Infra Engineer - Hosting Job San Francisco area,California USA,IT/Tech

Ready to take the next step in your career?

Join a rapidly growing AI cloud infrastructure provider building high-performance compute platforms for large-scale AI training and inference workloads. With expanding GPU infrastructure across Europe and the United States, the organisation enables AI teams to access scalable compute environments without traditional infrastructure limitations.

As a Senior ML Infrastructure Engineer, the successful candidate will help build and scale Kubernetes-based machine learning platforms supporting large-scale training and inference systems. The role focuses on workload orchestration, GPU scheduling, inference optimisation, and distributed systems reliability, working alongside highly technical teams at the intersection of machine learning, cloud infrastructure, and high-performance computing.

If you would like to learn more about this opportunity, feel free to reach out and apply today!

Responsibilities

Build and scale internal ML infrastructure platforms focused on AI training and inference workloads
Develop systems for workload orchestration, job scheduling, and reliable execution across Kubernetes environments
Improve and maintain inference infrastructure, including model packaging, deployment, and serving optimisation
Collaborate with infrastructure and platform teams to maximise GPU utilisation, hardware performance, and operational reliability
Design scalable systems and reusable platform capabilities that improve developer experience and operational efficiency
Support CI/CD, Git Ops, and infrastructure automation workflows across ML platform environments
Troubleshoot GPU performance, distributed systems behaviour, networking, and storage bottlenecks
Contribute to platform architecture discussions and long-term infrastructure scalability initiatives

Skills/Must Have

Strong ML engineering background with hands‑on experience supporting both training and inference infrastructure
Experience with infrastructure engineering, platform engineering, or software engineering environments
Strong programming skills in Python (Go experience is a plus)
Deep experience with Kubernetes, including operators, CRDs, workload orchestration, and GPU scheduling
Comfortable operating in Linux environments and debugging GPU‑related issues, including CUDA, drivers, networking, and file systems
Strong systems thinking and ability to design scalable, reliable, distributed infrastructure
Experience with CI/CD pipelines, Git Ops workflows, and infrastructure automation

Desirable Skills

Familiarity with orchestration and scheduling platforms such as Kueue, Flyte, Ray, or Slurm
Experience with PyTorch or JAX environments
Hands‑on experience deploying inference workloads using vLLM, SGLang, TensorRT‑LLM, or Triton
Knowledge of GPU networking and performance optimisation, including Infini Band, NVLink, and NCCL
Experience working within HPC or large-scale distributed systems environments

Benefits

Stock options

Salary

$250,000 base salary

#J-18808-Ljbffr

AI​/ML Infra Engineer - Hosting

AI/ML Infra Engineer - Hosting