Research Scientist/Engineer – Training Infrastructure
Listed on 2025-12-06
-
IT/Tech
Machine Learning/ ML Engineer, Systems Engineer, AI Engineer, Cloud Computing
Location: Iowa
Research Scientist / Engineer – Training Infrastructure at Luma AI
Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. We are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.
We are looking for engineers with significant experience solving hard problems in PyTorch, CUDA and distributed systems. You will work alongside the research team to build and train cutting‑edge foundation models on thousands of GPUs that are designed to scale from the ground up.
- Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
- Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
- Build monitoring, visualization, and debugging tools for large‑scale training runs
- Optimize training stability, convergence, and resource utilization across massive clusters
- Extensive experience with distributed PyTorch training and parallelism in foundation model training
- Deep understanding of GPU clusters, networking, and storage systems
- Familiarity with communication libraries (NCCL, MPI) and distributed system optimization
- (Preferred) Strong Linux systems administration and scripting capabilities
- (Preferred) Experience managing training runs across >100 GPUs
- (Preferred) Experience with containerization, orchestration, and cloud infrastructure
Mid‑Senior level
Employment TypeFull‑time
Job FunctionOther
IndustriesSoftware Development
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).