Principal Software Engineer - Rack Scale Systems Infrastructure Job Santa Clara area,California USA,Software Development

At NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build and guide the development of software systems. These systems support our upcoming rack‑scale infrastructure products and services. This exceptional role sits where software meets hardware. You will work on control planes, state machines, orchestration systems, firmware, OS lifecycle, and networking fabrics. Your task is to compose infrastructure‑as‑a‑service control plane software that converts complex rack‑scale hardware into dependable, manageable, and programmable infrastructure for NVIDIA, partners, and leading cloud and enterprise clients globally.

What

You Will Be Doing

Define the complete software architecture for rack‑scale infrastructure products and services, covering control plane services, infrastructure management, firmware, operating systems, kernel drivers, networking fabrics, accelerator software, and user‑mode manageability software.
Use Kubernetes and cloud‑native primitives as an infrastructure fabric when appropriate. This includes controllers, operators, reconciliation loops, and open source components. These components can operate safely at rack and fleet scale. Build open source infrastructure software that can be embraced in different forms, including libraries, services, controllers, operators, and integration APIs for internal deployments and CSP environments.
Bridge hardware and software teams across firmware, BMC, BIOS, boot flows, OS images, drivers, networking, NVLink domains, Infini Band, GPUs, DPUs, CPUs, and system management interfaces. Translate forward‑looking infrastructure roadmaps into formal software requirements, architecture specifications, and execution plans that align teams across the organization.
Partner directly with hyperscalers, CSPs, enterprise customers, internal component leads, vendors, and business partners to align infrastructure capabilities with real‑world deployment and integration needs. Establish reliability, security, validation, and left‑shift strategies that reduce risk before hardware reaches production environments.
Mentor senior engineers and technical leads, raising the engineering bar for large‑scale networked systems, foundational software, and rack‑scale control plane development.
Make high‑quality technical decisions in ambiguous environments, balancing customer needs, schedule, hardware realities, software maintainability, open source adoption, and long‑term infrastructure evolution.

What We Need To See

BS or MS in Computer Engineering, Computer Science, Electrical Engineering, or a related field, or equivalent experience. Proven experience (15+ years) in systems architecture, system software, distributed systems, infrastructure control planes, or infrastructure engineering.
Solid architectural knowledge of coordination frameworks, state machines, declarative APIs, reconciliation loops, lifecycle orchestration, failure handling, upgrade and rollback workflows, and distributed systems tradeoffs.
Practical coding skills in Go, C++, or Rust, encompassing the capability to write, review, and direct production‑quality infrastructure software. Experience with Rust is highly valued.
Experience with Kubernetes or similar orchestration systems, especially as a fabric for managing infrastructure, hardware resources, or large‑scale infrastructure services. Experience with Linux‑based infrastructure software, OS rollout and image management, kernel or driver interactions, firmware lifecycle, and hardware bring‑up workflows.
Strong understanding of data center networking technologies and protocols, such as Ethernet, Infini Band, RDMA, and fabric‑level manageability. Experience with complex accelerator‑based systems, including GPUs, DPUs, FPGAs, custom silicon, or other high‑performance computing systems.
Expertise in in‑band and out‑of‑band management architectures, including BMCs, Redfish, IPMI, and related system management protocols. Ability to work with security experts to define practical tradeoffs across secure boot, attestation, access control, update safety, serviceability, and ease of operation.
Experience crafting software intended for open source…