Software Engineer: ML Infra
Listed on 2026-06-17
-
Software Development
Machine Learning/ ML Engineer
About the Role
Generalist trains very large robot foundation models. This requires utilizing very large numbers of the latest generation GPU hardware and infrastructure (currently Nvidia) to run distributed training jobs and researcher experiments. We have extreme requirements on storage and data loading infrastructure that requires maximizing cloud infrastructure and custom solutions.
You will also own inference infrastructure. For our robots this is a fleet of on-prem GPUs attached to robots that have extreme real-time and latency budgets in compute constrained environments.
You’ll be responsible for:- Owning our GPU compute fleets
- Ensure our GPUs are easy for researchers to use and maximally utilized
- Optimizing and improving ML data loading transport and storage in highly distributed fully utilized environments.
- Orchestration of robot inference fleets
- Have managed large fleets of GPUs doing large-scale, long-term, highly distributed training runs or inference
- Deep experience in Slurm or Kubernetes for ML workload orchestration
- Have build high-scale ML data loaders and preparation systems
- Deeply understand every layer of the ML hardware, storage, and networking stacks
- Have experience in the NVidia GPU ecosystem
We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).