AI Inference Performance Engineer - College Grad
Job in
Santa Clara, Santa Clara County, California, 95053, USA
Listed on 2026-06-18
Listing for:
NVIDIA AI
Full Time
position Listed on 2026-06-18
Job specializations:
-
Software Development
AI Engineer (Applied/Software), Machine Learning/ ML Engineer
Job Description & How to Apply Below
What You Will Be Doing
- Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
- Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
- Architect distributed inference: design and optimize execution from single‑GPU to rack‑scale clusters, managing performance across clusters of GPUs.
- Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
- Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
- Technical leadership: raise the technical bar for the team, drive cross‑functional execution on tight benchmark timelines, and lead a world‑class team.
- BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
- 2+ years of relevant software development experience.
- Strong Python or C++ programming, software design, and software engineering skills.
- Expertise with a DL framework such as PyTorch or JAX.
- Proven track record of delivering measurable performance improvements in deep learning inference or high‑performance systems.
- Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode‑phase bottlenecks, speculative decoding, disaggregated serving, etc.
- Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang, etc) or a DL compiler in inference, deployment, algorithms, or implementation.
- Prior experience with performance modeling, profiling, debugging, and code optimization of a DL/HPC/high‑performance application.
- Experience with scale‑out inference orchestration (MPI, NCCL, K8S) on large GPU clusters.
- Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA, or other DL accelerators; GPU programming experience (CUDA).
- Track record of leading ambiguous, high‑impact technical programs across multiple teams under tight deadlines.
Compensation & Benefits:
Base salary range is 124,000 USD – 195,500 USD for Level 2, and 152,000 USD – 241,500 USD for Level
3. Additional equity and benefits are available.
NVIDIA is committed to fostering an inclusive work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×