Senior System Architect,Infrastructure Reliability Job Santa Clara area,California USA,IT/Tech

NVIDIA is seeking a Senior System Architect:
Heterogeneous EDA Systems to solve a complex challenge in accelerated computing:
Failure Attribution EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.

What you’ll be doing :

Architect Failure Attribution Frameworks:
Build a scalable ‘flight recorder’ for EDA jobs that captures high‑fidelity state across the CPU, GPU, and Fabric at the moment of failure.
Distributed Logging & Tracing:
Implement low‑overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi‑node Slurm or Kubernetes clusters.
Root Cause Automation:
Develop heuristics and models based on machine learning to classify failures as ‘Hardware Fault’, ‘Software Bug’, or ‘Environment Issue’. This reduces the Mean Time to Identify (MTTI) for R&D teams.
Resiliency Engineering:
Work closely with hardware and infrastructure teams to define ‘signals of impending failure’, enabling proactive job migration or checkpointing before a crash occurs.

What we need to see:

Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming. Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud‑scale environments.
CPU Architecture Deep‑Dive:
Expert knowledge of x86/ARM node‑level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.
Programming Proficiency:
Strong C++ and Python skills, with the ability to build high‑performance daemons that monitor system health without impacting workload performance.
Scale

Experience:

Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.

Ways to stand out from the crowd:

Low‑Level Diagnostics:
Expert knowledge of the Linux kernel and its error‑reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.
GPU Infrastructure Proficiency:
Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state‑dumps. Experience with tools doing non‑intrusive monitoring of application health and syscall‑level failure patterns.
Experience with checkpoint/restore technologies (like CRIU) and their application in long‑running EDA flows.

Compensation & Benefits

Base salary will be determined based on location, experience, and comparable positions: $184,000–$287,500 for Level4, and $224,000–$356,500 for Level
5. You will also be eligible for equity and benefits.

NVIDIA is committed to fostering an inclusive work environment and is proud to be an equal‑opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law.

#J-18808-Ljbffr

Senior System Architect, Infrastructure Reliability