AI Benchmarking and Telemetry Engineer - NVIS
Listed on 2026-04-22
-
IT/Tech
AI Engineer (Applied/Software), Systems Engineer, Data Engineer
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self‑driving cars that can understand the world.
Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.
- Formulate benchmarking methods for high‑performance computing and AI tasks, and perform and bring these methods to completion on large‑scale GPU clusters.
- Develop and maintain telemetry infrastructure to capture performance data at host‑level GPU/CPU, network fabric, and power/thermal characteristics within the facility.
- Collaborate closely with hardware engineering, software development, and customer‑facing teams to define performance requirements, fix bottlenecks, and validate configurations against real‑world workloads.
- Deploy and manage observability stacks such as Prometheus, Grafana, NVIDIA’s DCGM, and custom telemetry solutions to provide actionable insights into cluster health, utilization, and performance trends.
- Work directly with engineering partners to understand performance requirements, conduct on‑site benchmarking engagements, and deliver detailed analysis and recommendations for workload optimization.
- Maintain extensive knowledge of industry‑standard benchmarks in advanced computing and machine learning (e.g., HPL, HPCG, MLPerf, NCCL) and contribute to developing new benchmarking methodologies for emerging workloads.
- Bachelor’s degree in Computer Science, Electrical Engineering, Computer Engineering, or a related field (or equivalent experience).
- 8+ years of direct experience working with HPC and/or AI infrastructure, including cluster deployment, performance analysis, and benchmarking.
- Deep expertise in Linux system administration, including kernel tuning, process scheduling, storage I/O optimization, and solving performance issues at scale.
- Proven experience crafting and implementing telemetry and monitoring solutions for large‑scale distributed systems (Prometheus, Grafana, DCGM, collectd, or similar).
- Solid grasp of GPU architectures, CUDA programming principles, and GPU performance traits in high‑performance computing and artificial intelligence workloads.
- Familiarity with job schedulers (Slurm, PBS, LSF) and container orchestration platforms (Kubernetes, Docker) in HPC/AI environments.
- Proficiency in Python, Bash, and other scripting languages for automation, data analysis, and workflow orchestration.
- Excellent analytical and problem‑solving skills with the ability to interpret complex performance data and communicate findings to both technical and non‑technical audiences.
- Experience with high‑performance networking technologies such as Infini Band, RoCE, and Ethernet fabric tuning and performance analysis.
- Knowledge of parallel file systems (Lustre, GPFS, BeeGFS, Weka, VAST) with performance tuning and benchmarking.
- Background in power and thermal management for high‑density compute environments (PUE optimization, liquid cooling).
- Contributions to open‑source benchmarking tools or performance analysis frameworks.
- Industry certifications such as RHCE, CKA, or vendor‑specific HPC/data‑center credentials.
Competitive salaries and a comprehensive benefits package. Base salary range is $184,000 – $287,500 for Level 4 and $224,000 – $356,500 for Level 5. Eligible for equity and benefits.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).