×
Register Here to Apply for Jobs or Post Jobs. X

Director, Site Reliability and Software Engineering - DGX Cloud

Job in Santa Clara, Santa Clara County, California, 95053, USA
Listing for: NVIDIA Corporation
Full Time position
Listed on 2026-06-09
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, Systems Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below
Director, Site Reliability and Software Engineering - DGX Cloud page is loaded## Director, Site Reliability and Software Engineering - DGX Cloud locations:
US, CA, Santa Clara:
US, Remote time type:
Full time posted on:
Posted Todayjob requisition :
JR2017420

NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world.

We are looking for you.

NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges.

In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform.
** What you'll be doing:
** As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems.

Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical.

Operating with scale and speed, our world-class software engineers are just getting started -- and as a leader, you guide the way to solve reliability both our internally critical and our externally-visible systems.
* Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.
* Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment.
* Drive technical projects and provide leadership in an innovative and fast-paced environment.
* Be responsible for the overall planning, tracking and success of technical projects.
* Work closely with project and product management teams to ensure best-in-class product development.
* Contribute technically to the technical projects for DGX Cloud Computing Services.
* Interact with key internal stakeholders to provide operational and financial clarity on technical spend
* Drive Decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting. Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.
** What we need to see:
*** 12+ overall years of Experience in engineering management. 5+ years of leadership.
* Bachelor / Master degree in Computer Science, or equivalent experience.
* Experience in designing and implementing large-scale distributed systems. Experience in Containers / Virtualization environments/ Cluster solutions Experience in managing Technical Support / Dev Ops teams. Set appropriate technical excellent bars and deliver projects in tight deadlines.
* Strong knowledge in Unix/Linux.
* Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages
* Demonstrated people management and leadership skills, the proven track…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary