Principal Architect,System Software - Orbital Data Center Job Santa Clara area,California USA,Software Development

Space‑1 is NVIDIA’s first Orbital Data Center (ODC) module—a Vera Rubin–class compute platform engineered for low‑Earth orbit mission. It is the first step in a multi‑generation orbital roadmap to speed up AI adoption. We are looking for a strong technical architect to own end‑to‑end system software architecture for Space‑1 and successor orbital platforms. You will architect the full stack—from application to libraries, from the data center stack to BMC and BIOS firmware, manageability, telemetry, host OS, GPU and CPU drivers, and CUDA—to deliver a production‑ready inference platform that operates reliably in the radiation, thermal‑cycling, and remote‑operations environment of LEO.

You will partner closely with the orbital hardware system architecture team, drive customer use cases with constellation operators, align architecture with mission requirements, and bring the best orbital AI products to market.

What You’ll Be Doing

Own system architecture for inference stack and other applications running on this class of products and make it resilient to any fault happening in space.
Co‑architect with the orbital hardware system architecture team to define interfaces, partitioning, and trade‑offs across silicon, board, firmware, OS, and AI workload layers for 5‑year LEO missions.
Own end‑to‑end system software architecture for Space‑1 and successor Orbital Data Center modules—covering data center stack, BMC firmware, BIOS, host OS, GPU/CPU drivers, CUDA, DCGM, and manageability telemetry as a single integrated stack.
Define the manageability architecture for an unreachable, autonomous data center: remote bring‑up, in‑orbit firmware update, dual‑module redundancy, fault containment, recovery from SEU/SEFI events, and telemetry for fleets ranging from tens to millions of nodes.
Architect rad‑tolerant system software behaviors—ECC handling, memory scrubbing, latch‑up mitigation, deterministic recovery, and graceful degradation through 5 years and up to ~8,000 thermal cycles in dawn–dusk sun‑synchronous orbit.
Drive Redfish, MCTP, PLDM, and constellation‑level management protocols across BMC, BIOS, and host software so customers can operate orbital fleets with the same tools they use on the ground.
Define minimum BMC feature set, pin budget, boot architecture (rugged M.2 / VPX‑class options), and dual‑module redundancy strategy in partnership with platform and mechanical engineering.
Partner with cloud and constellation customers (Space

X, Blue Origin, Starcloud, Planet, Cowboy Space, and others) to translate mission requirements—orbit, duty cycle, NSA PHIPs security, post‑quantum networking (CX9), inference SLAs—into actionable platform software architecture.
Drive reliability and optimization in the system software architecture from an orbital data center viewpoint, including correct operation through eclipse periods and idle‑power retention strategies.
Work closely with the bring‑up team and resolve issues at Speed of Light from first silicon through first launch. Own quality, reliability, and telemetry performance of the system software delivered with each ODC module shipped to customers.

What We Need To See

15+ years of relevant experience in server/platform system software—spanning compute libraries, BMC firmware, BIOS, host OS, drivers, and manageability.
BS, MS, or PhD in EE/CS or related field of education (or equivalent experience).
Working experience in building AI infrastructure and systems in space. Proven record of architecting and delivering platform software for large‑scale data centers or mission‑critical embedded systems.
Strong knowledge of server architecture, data center manageability, and full‑stack integration of firmware with OS and accelerator software. Hands‑on experience with data center health management workflows, telemetry, and fault management at scale.
Solid understanding of hardware management interfaces (USB, SMBus/I2C, PCIe) and proficiency with modern management protocols including Redfish, MCTP, and PLDM.
Strong and demonstrable skill in C/C++ and Python.
Experience programming and debugging server platforms, including pre‑silicon and platform bring‑up environments.
Experience in…

Principal Architect, System Software - Orbital Data Center