×
Register Here to Apply for Jobs or Post Jobs. X

Senior AI Infra Engineer - Model Training Infrastructure; LLM​/VLM​/Agent RL

Job in San Jose, Santa Clara County, California, 95111, USA
Listing for: Tiktok
Apprenticeship/Internship position
Listed on 2026-06-19
Job specializations:
  • Engineering
    AI Engineer (Applied/Software)
Job Description & How to Apply Below
Position: Senior AI Infra Engineer - Large Model Training Infrastructure (LLM/VLM /Agent RL)
About the Team
We are dedicated to building the training infrastructure for ultra-large-scale language models, vision-language models, and frontier agentic models. Our mission is to provide a robust, scalable, and high-performance foundation for post-training, multimodal learning, and reinforcement learning at the hundred-billion-parameter scale and beyond. You will work on some of the most challenging problems in large-model training systems, from multimodal data efficiency to convergence optimization for next-generation foundation models.

What You'II Do

* Build and evolve unified training infrastructure for large models across post-training workflows, modalities, and training paradigms

* Design and optimize distributed training strategies for 100B to 1T parameter models, including DP, TP, PP, EP, operator fusion, memory optimization, and cluster-level MFU improvement

* Develop training and evaluation systems for Reasoning RL and Agent RL, including benchmarks, harnesses, convergence optimization, and rollout efficiency

* Enable multimodal training across image, text, audio, and video, and support emerging architectures such as MoE and Linear Attention with correctness and convergence validation

Minimum Qualifications:

* Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields

* 4+ years of experience in large-scale ML systems, training infrastructure, or performance optimization

* Strong programming skills in Python and C++

* Strong understanding of PyTorch and distributed training frameworks such as Deep Speed, Megatron, and FSDP

* Experience with distributed training for ultra-large models and strong debugging skills in convergence and system bottlenecks

Preferred Qualifications:

* Experience with PPO, GRPO, or Agent RL

* Experience building large-model evaluation systems, agentic harnesses, or benchmarking infrastructure

* Familiarity with multimodal training, post-training systems, MoE, or Linear Attention

* Experience with training optimization for 100B+ parameter models is a plus
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary