Senior AI Infra Engineer - Model Inference Systems; Multimodal/LLM/VLM Job San Jose area,California USA,IT/Tech

Position: Senior AI Infra Engineer - Large Model Inference Systems (Multimodal/LLM/VLM)
About the Team
We are dedicated to building the inference infrastructure for ultra-large-scale language models, vision-language models, and frontier multimodal AI systems. Our mission is to provide a robust, scalable, and high-performance foundation for distributed serving, heterogeneous scheduling, and low-latency inference at massive scale. You will work on some of the most challenging problems in large-model online serving, spanning traffic orchestration, throughput and latency optimization, kernel efficiency, and production reliability for next-generation AI systems.

Responsibilities - What You'II Do

* Build and evolve next-generation inference systems for large-scale online traffic, including global scheduling across heterogeneous compute resources, high-concurrency load balancing, and efficient batch formation

* Optimize distributed inference for 200B+ models and complex multimodal models through TP, EP, DP, and related strategies to improve throughput and latency in production

* Develop high-performance kernels for frontier model architectures such as MoE, emerging attention mechanisms, and multimodal fusion layers using CUDA, Triton, and related tools

* Explore AI-driven infrastructure for inference systems, including AI Agents for kernel optimization, performance tuning, consistency validation, deployment pipelines, and intelligent operations

Minimum Qualifications:

* Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, Mathematics, or related fields

* 4+ years of experience in high-performance computing, distributed scheduling systems, or large-model inference engine development

* Familiarity with large-model architectures and strong system design skills for complex, high-concurrency environments

* Strong understanding of asynchronous scheduling, resource pooling, and load balancing in distributed microservice systems

* Strong engineering skills in performance optimization and production system development

Preferred Qualifications

* Deep understanding of inference frameworks such as vLLM and SGLang, with hands-on experience in customization and production optimization

* Familiarity with GPU microarchitecture and operator-level optimization using CUDA, Triton, Cutlass, or related tools

* Experience with LLM inference optimization, such as PTQ, QAT, KV cache optimization, or PD disaggregation

* Experience deploying and optimizing VLMs or multimodal models in production

Senior AI Infra Engineer - Model Inference Systems; Multimodal​/LLM​/VLM

Senior AI Infra Engineer - Model Inference Systems; Multimodal/LLM/VLM