×
Register Here to Apply for Jobs or Post Jobs. X

Distinguished Engineer, GPU Fleet Operations Automation

Job in Santa Clara, Santa Clara County, California, 95053, USA
Listing for: NVIDIA Gruppe
Full Time position
Listed on 2026-05-30
Job specializations:
  • IT/Tech
    Cloud Computing: Infrastructure & Operations, Systems Engineer
Salary/Wage Range or Industry Benchmark: 320000 - 488750 USD Yearly USD 320000.00 488750.00 YEAR
Job Description & How to Apply Below

Responsibilities

Lead the development of the DGX Cloud strategy for GPU fleet lifecycle, health, observability, utilization monitoring, and remediation across multiple environments (bare metal, public cloud, and neoclouds). Define and drive technical strategy, implement auto‑remediation to detect, fix, validate, and restore critical systems. Collaborate with NVIDIA leadership, customers, infrastructure providers, and partners to deliver high‑availability accelerated computing infrastructure.

  • Define and drive the technical implementation of DGX Cloud operations practice for GPU fleet lifecycle.
  • Drive technical strategy and awareness for best practices into DGX Cloud engineering practices.
  • Guide technical delivery across all delivery environments: enterprise, public cloud, and high‑security, isolated, sovereign.
  • Collaborate with stakeholders to set industry standards for operational excellence.
  • Lead all technical aspects of planning and continuous evolution across large technical scope from ideation to full lifecycle management.
Qualifications
  • 15–18+ years in technical roles with focus on operations and automation for cloud infrastructure, platforms, and applications.
  • 5–10+ years of lead experience.
  • BS/MS or higher in systems/software engineering, or equivalent experience.
  • Technical proficiency in multi‑tenant data center and cloud‑native architectures: bare metal, virtualization, containerization, IaaS, Kubernetes, Slurm, AI/ML platforms.
  • Proven success delivering high‑impact technically complex solutions with transparency into resource utilization, performance, and operational insights.
  • Strong technical leadership: synthesizing multi‑functional needs into architecture and design while guiding execution across complementary teams.
  • Excellent communication and partnership skills, influencing peers, partners, and customers.
Preferred Experience
  • AI application for component and system level issue identification and remediation.
  • Design, development, delivery, and operation of highly available scaled‑out systems in enterprise and cloud environments.
  • History of creating scalable processes and extensible systems for operations at scale.
  • Familiarity with open‑source ecosystems; experience collaborating and influencing open‑source project governance.
Salary and Benefits

Base salary range: $320,000

USD–$488,750

USD, determined by location, experience, and market data. Eligible for equity and benefits.

Equal Opportunity Statement

We are committed to fostering a diverse work environment and proudly serve as an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary