Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems
Listed on 2026-02-12
-
Software Development
Cloud Engineer - Software, DevOps, Software Engineer
Senior DGX Cloud Software Engineer – Infrastructure Automation and Distributed Systems
We are seeking Software Engineers with previous experience building and running private and public clouds at production scale. As part of the DGX Cloud team, you’ll support our customers’ journeys in AI training and inference development by building platforms, tools, and services that defend the operational capacity of our bare‑metal accelerated compute infrastructure and codify reliability best practices in the broader DGX Cloud platform ecosystem.
WhatYou’ll Be Doing
- Design, build, and run cloud infrastructure services to meet business goals, including integrations, migrations, bring‑ups, updates, and decommissions as necessary.
- Participate in the definition of internal service‑level objectives and error budgets as part of our observability strategy.
- Eliminate toil or automate it where ROI warrants building and maintaining automation.
- Practice sustainable blameless incident prevention and response while participating in an on‑call rotation.
- Consult with peers on systems‑design best practices.
- Participate in a supportive culture of values‑driven introspection, communication, and self‑organization.
- Proficiency in Python or Go.
- BS degree in Computer Science or a related technical field, or equivalent experience.
- 5+ years of relevant experience in infrastructure and fleet management engineering.
- Experience with infrastructure automation and distributed‑systems design, developing tools for large‑scale private or public cloud systems that require fully automated management in production.
- A track record of initiating projects, convincing collaborators, and contributing to projects initiated by others.
- In‑depth knowledge of Linux, Slurm, Kubernetes, distributed storage, and systems networking.
- Systematic problem‑solving approach, clear communication, ownership, and results (e.g., build/reuse/buy decisions).
- Experience with bare‑metal as a service, multi‑cloud infrastructure services, and teaching reliability engineering or other scale‑oriented cloud practices to peers.
- Experience with accelerated compute and communications technologies such as Blue Field Networking, Infini Band topologies, NVMesh, and/or NCCL.
- Experience working with a centralized security organization to prioritize and mitigate security risks; prior ML/AI work is a plus.
We offer a competitive base salary based on location, experience, and peer benchmarks: $168,000 – 270,250 USD for Level 4 and $208,000 – 333,500 USD for Level 5, plus equity and benefits.
Applications are accepted until December 3, 2025. NVIDIA is committed to fostering a diverse work environment and is an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).