×
Register Here to Apply for Jobs or Post Jobs. X

Senior AI Infra Engineer - Model Inference Systems; Multimodal​/LLM​/VLM

Job in San Jose, Santa Clara County, California, 95111, USA
Listing for: Tiktok
Full Time position
Listed on 2026-06-19
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software), Machine Learning/ ML Engineer
Job Description & How to Apply Below
Position: Senior AI Infra Engineer - Large Model Inference Systems (Multimodal/LLM/VLM)
About the Team
We are dedicated to building the inference infrastructure for ultra-large-scale language models, vision-language models, and frontier multimodal AI systems. Our mission is to provide a robust, scalable, and high-performance foundation for distributed serving, heterogeneous scheduling, and low-latency inference at massive scale. You will work on some of the most challenging problems in large-model online serving, spanning traffic orchestration, throughput and latency optimization, kernel efficiency, and production reliability for next-generation AI systems.

Responsibilities - What You'II Do

* Build and evolve next-generation inference systems for large-scale online traffic, including global scheduling across heterogeneous compute resources, high-concurrency load balancing, and efficient batch formation

* Optimize distributed inference for 200B+ models and complex multimodal models through TP, EP, DP, and related strategies to improve throughput and latency in production

* Develop high-performance kernels for frontier model architectures such as MoE, emerging attention mechanisms, and multimodal fusion layers using CUDA, Triton, and related tools

* Explore AI-driven infrastructure for inference systems, including AI Agents for kernel optimization, performance tuning, consistency validation, deployment pipelines, and intelligent operations

Minimum Qualifications:

* Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields

* 4+ years of experience in high-performance computing, distributed scheduling systems, or large-model inference engine development

* Familiarity with large-model architectures and strong system design skills for complex, high-concurrency environments

* Strong understanding of asynchronous scheduling, resource pooling, and load balancing in distributed microservice systems

* Strong engineering skills in performance optimization and production system development

Preferred Qualifications

* Deep understanding of inference frameworks such as vLLM and SGLang, with hands-on experience in customization and production optimization

* Familiarity with GPU microarchitecture and operator-level optimization using CUDA, Triton, Cutlass, or related tools

* Experience with LLM inference optimization, such as PTQ, QAT, KV cache optimization, or PD disaggregation

* Experience deploying and optimizing VLMs or multimodal models in production
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary