System Software Engineer,Distributed Systems Job Santa Clara area,California USA,Software Development

The VLSI Productivity and Infrastructure team supports 1000+ chip design engineers by building tools and platforms that supercharge their everyday work. Our mission is to make chip designers faster.

We build and operate long shelf‑life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization, with a strong commitment to stability. Our core workflow infrastructure runs as userspace software on bare‑metal Linux hosts (no sudo, no containers). We coordinate shared state and artifacts via NFS, launch long‑running, compute‑heavy workflows on IBM LSF, and provide adjacent services for APIs and observability.

This is a high‑ownership environment where you'll often be the expert on what you build.

What you will be doing:

Design, build, and deliver core components of our next‑generation productivity platforms
Develop reliable userspace infrastructure for long‑running engineering workflows at scale on bare‑metal Linux hosts
Build state coordination over NFS (atomicity, idempotency/dedup, partial‑write recovery, without privileged ops)
Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/back pressure)
Convert legacy codebases into modern powerhouses using incremental migration techniques (e.g., Perl to Go), with stage gates, parity strategies, and strong observability
Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
Collaborate with engineering users to turn ambiguous workflows into durable production systems

What we need to see:

B.S. CS/EE (or equivalent experience)
5+ years developing and operating production software in Go and/or Python, ideally in large codebases
Strong Linux fundamentals: processes, file systems, permissions, synchronization/locks, concurrency, and debugging
Solid distributed‑systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
Experience building long‑runtime automation or services on shared compute clusters (batch schedulers, build systems)
Ability to translate ambitious, high‑level goals into a safe delivery plan (instrumentation, staged rollout, measurable outcomes)

Ways to stand out from the crowd:

Hands‑on experience with shared file systems at scale (NFS), or coordination patterns on eventually‑consistent storage
Experience with batch job scheduling, shared compute fleets, or build systems
Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
Experience partitioning/optimizing metadata‑heavy systems and reducing I/O or R/W hot spots
Strong incident/debug tactics: clear root‑cause analysis, remediation, and guardrails as well as rapid comprehension and ownership of unfamiliar codebases in any language (including LLM‑generated code) to implement high‑leverage changes

With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level3, and 184,000 USD - 287,500 USD for Level
4.

You will also be eligible for equity and benefits.

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr

System Software Engineer, Distributed Systems