Software Engineer, Hardware Health
Listed on 2026-06-29
-
IT/Tech
Systems Engineer, Unix/Linux, SRE/Site Reliability, IT Infrastructure
About the Team
The Hardware Health and Observability team owns the end-to-end health lifecycle of the company’s global compute fleet.
Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling.
We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably are the last line of defense for the success of OAI’s production and research workloads.
About the RoleOn the Hardware Health and Observability team, you’ll build critical infrastructure that keeps the company’s largest compute clusters healthy and operational n small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally.
Responsibilities- Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure.
- Build and evolve health checks that detect, remediate, and verify failures at scale.
- Ensure critical health checks execute with minimal latency to maximize workload uptime.
- Investigate hardware failures and system-level issues across large-scale compute environments.
- Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes.
- Build automation and tooling that enables global cluster management with minimal manual intervention.
- Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems.
- 7+ years of industry experience in software or infrastructure engineering.
- Strong proficiency with Python and shell scripting.
- Experience building large-scale distributed systems or infrastructure platforms.
- Comfort digging into noisy operational data using SQL, PromQL, or similar tooling.
- Experience building reproducible analyses and operational tooling.
- Strong systems debugging and operational instincts with an ownership mindset.
- Experience with low-level hardware systems and Linux tooling (e.g. PCIe, Infini Band, RoCE, networking, power management, kernel performance tuning, FW/SW debugging).
- Experience operating or debugging large-scale GPU or accelerator clusters.
- Expertise in network operations, observability, or systems telemetry.
- Experience with automated remediation systems or fleet lifecycle management.
- Experience improving reliability, utilization, or workload uptime in distributed compute environments.
We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).