Senior AI Infra Engineer - Model Training Infrastructure; LLM/VLM/Agent RL Job San Jose area,California USA,Engineering

Position: Senior AI Infra Engineer - Large Model Training Infrastructure (LLM/VLM /Agent RL)
About the Team
We are dedicated to building the training infrastructure for ultra-large-scale language models, vision-language models, and frontier agentic models. Our mission is to provide a robust, scalable, and high-performance foundation for post-training, multimodal learning, and reinforcement learning at the hundred-billion-parameter scale and beyond. You will work on some of the most challenging problems in large-model training systems, from multimodal data efficiency to convergence optimization for next-generation foundation models.

What You'II Do

* Build and evolve unified training infrastructure for large models across post-training workflows, modalities, and training paradigms

* Design and optimize distributed training strategies for 100B to 1T parameter models, including DP, TP, PP, EP, operator fusion, memory optimization, and cluster-level MFU improvement

* Develop training and evaluation systems for Reasoning RL and Agent RL, including benchmarks, harnesses, convergence optimization, and rollout efficiency

* Enable multimodal training across image, text, audio, and video, and support emerging architectures such as MoE and Linear Attention with correctness and convergence validation

Minimum Qualifications:

* Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields

* 4+ years of experience in large-scale ML systems, training infrastructure, or performance optimization

* Strong programming skills in Python and C++

* Strong understanding of PyTorch and distributed training frameworks such as Deep Speed, Megatron, and FSDP

* Experience with distributed training for ultra-large models and strong debugging skills in convergence and system bottlenecks

Preferred Qualifications:

* Experience with PPO, GRPO, or Agent RL

* Experience building large-model evaluation systems, agentic harnesses, or benchmarking infrastructure

* Familiarity with multimodal training, post-training systems, MoE, or Linear Attention

* Experience with training optimization for 100B+ parameter models is a plus

Senior AI Infra Engineer - Model Training Infrastructure; LLM​/VLM​/Agent RL

Senior AI Infra Engineer - Model Training Infrastructure; LLM/VLM/Agent RL