AI Inference Performance Engineer - College Grad Job Santa Clara area,California USA,Software Development

Position: AI Inference Performance Engineer - New College Grad 2026

Overview

We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining performance standards across language models, video generation, and speech workloads. We work within TensorRT-LLM, SGLang, and vLLM, building tools that evaluate serving performance s team sits at the intersection of GPU performance engineering and public accountability.

Responsibilities

Drive industry benchmark results: own end-to-end optimization pipeline, implement and integrate optimizations in quantization, scheduling, memory management, and distributed inference across TensorRT-LLM, SGLang, and vLLM.
Define and optimize cutting-edge workloads: identify and shape next-generation inference benchmarks, multi-turn coding, agentic workflows, and other emerging AI use cases. Collaborate with framework and kernel teams to push performance to its extreme on large-scale LLM-MoE models, vision-language models, video diffusion models, recommendation, and speech workloads.
Architect distributed inference: design and optimize execution from single-GPU to rack-scale clusters, managing performance across clusters of GPUs.
Establish performance methodology: apply roofline analysis and systematic profiling to decompose bottlenecks across CUDA kernels, frameworks, and serving layers.
Influence the ecosystem: contribute to TensorRT-LLM, vLLM, SGLang, and other open-source projects. Partner with architecture, kernel, and compiler teams to shape GPU roadmaps based on real workload data.
Technical leadership: raise the technical bar for the team, drive cross-functional execution on tight benchmark timelines, and lead a world-class team.

Qualifications

BS, MS, or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent experience.
2+ years of relevant software development experience.
Strong Python or C++ programming, software design, and software engineering skills.
Expertise with a DL framework such as PyTorch or JAX.
Proven track record of delivering measurable performance improvements in deep learning inference or high-performance systems.
Deep understanding of LLM/VLM architectures and inference mechanics: attention, KV caching, batching strategies, decode-phase bottlenecks, speculative decoding, disaggregated serving, etc.

#J-18808-Ljbffr