AI Inference Performance Engineer - College Grad Job Santa Clara area,California USA,Software Development

Position: AI Inference Performance Engineer - New College Grad 2026

What You Will Be Doing

Drive industry benchmark results: own the end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
Architect distributed inference: design and optimize execution from single‑GPU to rack‑scale clusters, managing performance across clusters of GPUs.
Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
Technical leadership: raise the technical bar for the team, drive cross‑functional execution on tight benchmark timelines, and lead a world‑class team.

What We Need To See

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
2+ years of relevant software development experience.
Strong Python or C++ programming, software design, and software engineering skills.
Expertise with a DL framework such as PyTorch or JAX.
Proven track record of delivering measurable performance improvements in deep learning inference or high‑performance systems.
Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode‑phase bottlenecks, speculative decoding, disaggregated serving, etc.

Ways To Stand Out From The Crowd

Prior experience with an LLM framework (TensorRT-LLM, vLLM, SGLang, etc) or a DL compiler in inference, deployment, algorithms, or implementation.
Prior experience with performance modeling, profiling, debugging, and code optimization of a DL/HPC/high‑performance application.
Experience with scale‑out inference orchestration (MPI, NCCL, K8S) on large GPU clusters.
Expertise in kernel development (CUTLASS, cuteDSL, tilelang, OpenAI Triton) or compiler/runtime paths (torch.compile, graph lowering, operator fusion). Architectural knowledge of CPU, GPU, FPGA, or other DL accelerators; GPU programming experience (CUDA).
Track record of leading ambiguous, high‑impact technical programs across multiple teams under tight deadlines.

Compensation & Benefits:
Base salary range is 124,000 USD – 195,500 USD for Level 2, and 152,000 USD – 241,500 USD for Level
3. Additional equity and benefits are available.

NVIDIA is committed to fostering an inclusive work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

#J-18808-Ljbffr