Senior System Architect, Infrastructure Reliability
Listed on 2026-06-19
-
IT/Tech
Systems Engineer -
Engineering
Systems Engineer
NVIDIA is seeking a Senior System Architect:
Heterogeneous EDA Systems to solve a complex challenge in accelerated computing:
Failure Attribution EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.
- Architect Failure Attribution Frameworks:
Build a scalable ‘flight recorder’ for EDA jobs that captures high‑fidelity state across the CPU, GPU, and Fabric at the moment of failure. - Distributed Logging & Tracing:
Implement low‑overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi‑node Slurm or Kubernetes clusters. - Root Cause Automation:
Develop heuristics and models based on machine learning to classify failures as ‘Hardware Fault’, ‘Software Bug’, or ‘Environment Issue’. This reduces the Mean Time to Identify (MTTI) for R&D teams. - Resiliency Engineering:
Work closely with hardware and infrastructure teams to define ‘signals of impending failure’, enabling proactive job migration or checkpointing before a crash occurs.
- Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud‑scale environments.
- CPU Architecture Deep‑Dive:
Expert knowledge of x86/ARM node‑level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts. - Programming Proficiency:
Strong C++ and Python skills, with the ability to build high‑performance daemons that monitor system health without impacting workload performance. - Scale
Experience:
Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.
- Low‑Level Diagnostics:
Expert knowledge of the Linux kernel and its error‑reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults. - GPU Infrastructure Proficiency:
Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state‑dumps. Experience with tools doing non‑intrusive monitoring of application health and syscall‑level failure patterns. - Experience with checkpoint/restore technologies (like CRIU) and their application in long‑running EDA flows.
Base salary will be determined based on location, experience, and comparable positions: $184,000–$287,500 for Level4, and $224,000–$356,500 for Level
5. You will also be eligible for equity and benefits.
NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).