AI/ML Infra Engineer - Hosting
Listed on 2026-06-25
-
IT/Tech
SRE/Site Reliability, IT Infrastructure, Systems Engineer
Ready to take the next step in your career?
Join a rapidly growing AI cloud infrastructure provider building high-performance compute platforms for large-scale AI training and inference workloads. With expanding GPU infrastructure across Europe and the United States, the organisation enables AI teams to access scalable compute environments without traditional infrastructure limitations.
As a Senior ML Infrastructure Engineer, the successful candidate will help build and scale Kubernetes-based machine learning platforms supporting large-scale training and inference systems. The role focuses on workload orchestration, GPU scheduling, inference optimisation, and distributed systems reliability, working alongside highly technical teams at the intersection of machine learning, cloud infrastructure, and high-performance computing.
If you would like to learn more about this opportunity, feel free to reach out and apply today!
Responsibilities- Build and scale internal ML infrastructure platforms focused on AI training and inference workloads
- Develop systems for workload orchestration, job scheduling, and reliable execution across Kubernetes environments
- Improve and maintain inference infrastructure, including model packaging, deployment, and serving optimisation
- Collaborate with infrastructure and platform teams to maximise GPU utilisation, hardware performance, and operational reliability
- Design scalable systems and reusable platform capabilities that improve developer experience and operational efficiency
- Support CI/CD, Git Ops, and infrastructure automation workflows across ML platform environments
- Troubleshoot GPU performance, distributed systems behaviour, networking, and storage bottlenecks
- Contribute to platform architecture discussions and long-term infrastructure scalability initiatives
- Strong ML engineering background with hands‑on experience supporting both training and inference infrastructure
- Experience with infrastructure engineering, platform engineering, or software engineering environments
- Strong programming skills in Python (Go experience is a plus)
- Deep experience with Kubernetes, including operators, CRDs, workload orchestration, and GPU scheduling
- Comfortable operating in Linux environments and debugging GPU‑related issues, including CUDA, drivers, networking, and file systems
- Strong systems thinking and ability to design scalable, reliable, distributed infrastructure
- Experience with CI/CD pipelines, Git Ops workflows, and infrastructure automation
- Familiarity with orchestration and scheduling platforms such as Kueue, Flyte, Ray, or Slurm
- Experience with PyTorch or JAX environments
- Hands‑on experience deploying inference workloads using vLLM, SGLang, TensorRT‑LLM, or Triton
- Knowledge of GPU networking and performance optimisation, including Infini Band, NVLink, and NCCL
- Experience working within HPC or large-scale distributed systems environments
- Stock options
- $250,000 base salary
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).