Senior Software Engineer,AI Networking Job Santa Clara area,California USA,Software Development

NVIDIA seeks a senior software engineer to join the AI Networking co‑design and benchmark R&D team. In this pivotal role, the candidate is responsible for building and productizing machine learning tools that use ML‑based combinatorial optimization and design space exploration (DSE) techniques to optimize AI workloads across large GPU and CPU clusters, ensuring the most efficient utilization of system resources at data center scale.

The role involves working on distributed Deep Learning, particularly within LLM training and inference stacks, and requires a strong passion for collective communication and networking. The candidate will interact with diverse hardware and platforms, such as Host Channel Adapters (HCAs), Switches, CPUs, GPUs, and complete Systems, and engage across multiple software layers, including LLM applications, machine learning frameworks, and communication and computing libraries.

What

you'll be doing:

Design and implement resource allocation and combinatorial optimization techniques (e.g., reinforcement learning, LLM agents for DSE, Bayesian optimization and other multi‑objective optimization techniques) to optimize LLM models at datacenter scale.
Research, develop, and deploy AI/ML techniques to optimize large‑scale Deep Learning (LLM) training and inference on NVIDIA supercomputers and distributed systems, focusing on high‑performance networking and NVIDIA communication libraries.
Build and product ionize ML‑based tools for performance prediction and optimization, with a strong emphasis on networking aspects.
Develop and deploy a scalable, reliable data curation pipeline capable of handling complex data types, such as time series and PyTorch model graphs, to effectively support the training of high‑performance Machine Learning models.
Collaborate across hardware and software teams to deliver valuable performance analysis insights.
Lead performance test planning, establish performance targets for new technologies and solutions, and drive efforts to achieve those performance goals.

What we need to see:

PhD or Master's degree in Computer Science, Software Engineering, or equivalent experience.
4+ years of experience applying machine learning techniques to computer architecture and system optimization problems. Desired experience involves leveraging ML at the intersection of at least two of the following areas: HPC, networking, and AI applications.
Hands‑on experience developing and deploying various learning algorithms (e.g., reinforcement learning, offline RL, supervised learning) to tackle optimization challenges within computer architecture, system design, or networking domains.
Proficiency in building and using ML models with leading frameworks such as PyTorch or Tensor Flow, or JAX.
Proven ability to apply GNNs/transformers‑based optimization to PyTorch model graph and Kineto execution traces.
Expertise combining knowledge of NVIDIA GPUs, the CUDA library, and deep learning frameworks (Tensor Flow/PyTorch) with networking concepts, including collective communication libraries (e.g., NCCL) and protocols (such as RoCE and RDMA).
Strong programming capabilities in Python, Bash, and C++.
A collaborative teammate with effective communication and interpersonal abilities.

Ways to stand out from the crowd:

In‑depth knowledge and experience with machine learning/reinforcement learning and frameworks.
Comprehensive understanding of computer architecture, system architecture and networking.
Extensive experience in applying machine learning techniques such as GNNs or related graph‑based models.
Knowledge in PyTorch, CUDA, and NCCL libraries.
Proven software engineering/development skills.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD – 241,500 USD for Level 3, and 184,000 USD – 287,500 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until April 10, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

Senior Software Engineer, AI Networking