Research Scientist: Post-Training
Listed on 2026-05-30
-
IT/Tech
Systems Engineer
Staff / Principal ML Training Systems Engineer
We are building next-generation intelligent systems capable of operating in complex, real-world environments. Our team develops the full stack — from high-performance hardware and distributed systems infrastructure to large-scale multimodal foundation models powering autonomous decision-making.
Backed by significant funding and operating at the intersection of AI, systems engineering, and large-scale compute infrastructure, we are investing heavily in research, infrastructure, and scalable training systems to push the frontier of embodied intelligence.
We are seeking a Staff / Principal ML Training Systems Engineer to lead training systems performance across large-scale multimodal AI workloads. This is a core systems engineering role focused on scalability, efficiency, and correctness at massive GPU scale. Your work will directly impact infrastructure utilization, training throughput, and research iteration speed.
What You’ll Do Own Training Performance End-to-End- Diagnose and optimize performance for large-scale multimodal training workloads involving vision, video, language, sensor data, and sequential decision-making
- Build systematic performance attribution tooling, including:
- Compute vs communication analysis
- Scaling curve analysis across cluster sizes
- Bottleneck identification and prioritization
Improve distributed training efficiency through:
- Communication/computation overlap
- Topology-aware workload placement
- Parallelism optimization strategies
Improve compute efficiency through:
- Operator fusion
- Attention optimization
- Runtime and framework overhead reduction
Improve memory efficiency through:
- Sequence packing and bucketing
- Memory fragmentation reduction
- Define and optimize data, tensor, pipeline, sharded, and hybrid parallelism strategies
- Improve execution efficiency through:
- Communication scheduling and overlap
- Graph capture and execution optimization
- Runtime-level improvements
- Extend and improve internal training frameworks where necessary
- Establish source-of-truth performance metrics including:
- Step-time breakdowns
- Throughput and scaling efficiency
- Build tooling to:
- Compare scaling behavior across model families and cluster configurations
- Track performance regressions over time
- Develop automated benchmarking and regression detection systems
- Collaborate directly with research scientists and ML engineers in a highly integrated environment
- Translate novel model architectures and research ideas into scalable, production-ready implementations
- Advise on training tradeoffs involving:
- Long-horizon sequence modeling
- Multimodal and variable-length data
- Evaluation cadence and rollout efficiency
- Work with infrastructure and reliability teams to optimize utilization across large distributed workloads
- Analyze the impact of networking, collectives, and cluster topology on training efficiency
- Improve topology-aware scheduling and large-scale scaling behavior
- Deep hands-on experience with modern ML frameworks (PyTorch required; JAX is a plus)
- Strong understanding of:
- Data, tensor, and pipeline parallelism
- FSDP / ZeRO-style sharded training
- Communication overlap strategies
- Large-scale GPU cluster scaling behavior
- Strong systems intuition across compute, communication, and memory bottlenecks
- Exceptional debugging and performance analysis skills
- High ownership mindset and comfort operating in fast-moving, highly technical environments
- GPU kernel or compiler-level optimization experience (CUDA, Triton, graph capture, operator fusion)
- Experience with multimodal or video training involving variable-length sequences and packing strategies
- Experience building or extending distributed training frameworks and runtimes
- Familiarity with cluster networking, topology-aware scheduling, and large-scale infrastructure effects
- Direct impact on research velocity — every efficiency improvement accelerates model development across the organization
- Opportunity to shape the scalability and performance of next-generation multimodal training systems
- High-leverage engineering work with compounding impact across all training workloads
- Small, highly technical team with significant ownership and autonomy
We are a research-driven AI company focused on building scalable intelligent systems capable of robust operation in dynamic environments. By combining advances in machine learning, distributed systems, and infrastructure engineering, we aim to push the frontier of large-scale AI systems.
We are committed to building an inclusive and diverse workplace and encourage applicants from all backgrounds to apply.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).