Senior System Software Engineer - GPU Job Santa Clara area,California USA,Software Development

Position: Senior System Software Engineer - GPU Performance

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

What

You Will Be Doing

Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack.
Evaluate proof‑of‑concepts, conduct trade‑off analysis when multiple solutions are available.
Triage and root‑cause performance issues reported by our customers.
Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information.
Collaborate with a very dynamic team across multiple time zones.

What We Need To See

M.S. (or equivalent experience) or Ph.D. in Computer Science, or related field with relevant performance engineering and HPC experience.
3+ years of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
Experience conducting performance benchmarking and triage on large scale HPC clusters.
Good understanding of computer system architecture, HW‑SW interactions and operating systems principles (aka systems software fundamentals).
Implement micro‑benchmarks in C/C++, read and modify the code base when required.
Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python.
Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and time zones.

Ways To Stand Out From The Crowd

Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control.
Experience debugging network issues in large scale deployments.
Familiarity with CUDA programming and/or GPUs.
Experience with Deep Learning Frameworks such as PyTorch, Tensor Flow.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until April 14, 2026.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr