×
Register Here to Apply for Jobs or Post Jobs. X

Senior Datacenter Resiliency Architect

Job in Santa Clara, Santa Clara County, California, 95053, USA
Listing for: TieTalent
Full Time position
Listed on 2025-12-01
Job specializations:
  • IT/Tech
    Systems Engineer, Hardware Engineer, Data Engineer, AI Engineer
Job Description & How to Apply Below

Join to apply for the Senior Datacenter Resiliency Architect role at Tie Talent

We are seeking a Senior Datacenter Resiliency (RAS) Architect to support the development and validation of GPU hardware and software resiliency features. You will be a key member of a team of innovators, challenging the status quo and pushing beyond boundaries, with impact on the industry’s leading Datacenter GPUs and SOCs powering AI and HPC products.

What you’ll be doing
  • Architect hardware and software resiliency features to improve system Reliability, Availability, Serviceability (RAS), and performance in the Datacenter.
  • Model and analyze RAS metrics (e.g., Failures in Time for permanent and transient errors, Availability from GPU to Rack to Datacenter); use models to identify gaps and drive RAS improvements.
  • Collaborate with architects, unit designers, and software engineers to ensure alignment of verification requirements.
  • Develop and implement comprehensive architecture verification test plans for resiliency features.
  • Execute Architecture Test Plan by developing test content and enabling, running, and debugging tests on architecture models; support test debug on RTL, emulation, and silicon.
  • Run simulations to analyze Architectural Vulnerability Factor and liveness of on-die memory, flip-flops, and latches.
  • Develop CUDA software diagnostics kernels to run on clusters of NVIDIA GPUs to identify hardware issues.
  • Develop and automate fault models to simulate various fault types (e.g., transient faults, stuck-at faults) in gate-level netlists, RTL, architectural models, silicon, and other environments.
What we need to see
  • Master’s or PhD in Computer Engineering, Electrical Engineering, or closely related field, or equivalent experience.
  • At least 5+ years of relevant experience.
  • Familiarity with GPU and networking architectures, computer architecture basics (caches, coherence, buses, DMA), and machine learning/deep learning concepts.
  • Strong knowledge and experience in GPU hardware architecture or RAS features, or both.
  • Proficiency in developing architecture models.
  • Scripting and automation with Python or similar; proficiency in C/C++.
  • Excellent interpersonal skills and ability to collaborate with on-site and remote teams; strong debugging and analytical skills; self-driven and results oriented.
  • Experience with resiliency and datacenter RAS or Verilog/System Verilog RTL simulations and debugging; ability to set up test benches and integrate components is a plus.
  • Programming with CUDA is a plus.
Company/role notes

NVIDIA’s work spans high-performance computing and AI computing—roles involve building resilient, high-availability computing platforms for AI, HPC, and data center workloads. NVIDIA is an equal opportunity employer; we do not discriminate on protected characteristics.

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary