System Software Engineer,Distributed Systems Job Santa Clara area,California USA,Software Development

System Software Engineer, Distributed Systems page is loaded## System Software Engineer, Distributed Systems locations:
US, CA, Santa Claratime type:
Full time posted on:
Posted Todayjob requisition :
JR2013472

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world.

Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

The VLSI Productivity and Infrastructure team supports 1000+ chip design engineers by building tools and platforms that supercharge their everyday work. Our mission: make chip designers faster. We build and operate long shelf-life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization—with a strong commitment to stability. Our core workflow infrastructure runs as userspace software on bare-metal Linux hosts (no sudo, no containers).

We coordinate shared state and artifacts via NFS, launch long-running, compute-heavy workflows on IBM LSF, and provide adjacent services for APIs and observability. This is a high-ownership environment where you'll often be the expert on what you build. We are looking for a pragmatic and versatile systems engineer who enjoys working near the metal and building tools that empower other engineers.

This is a generalist role with an emphasis on distributed systems and operational excellence in a “below containers” world: coordination, reliability, performance, and safe evolution of legacy systems (including incremental modernization of large codebases into Go). This isn't a CI/CD pipeline configuration role; you will be writing the userspace software that manages state, concurrency, and reliability at scale.##
** What you will be doing:
*** Design, build, and deliver core components of our next-generation productivity platforms
* Develop reliable userspace infrastructure for long-running engineering workflows at scale on bare-metal Linux hosts
* Build state coordination over NFS (atomicity, idempotency/dedup, partial-write recovery, without privileged ops)
* Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/back pressure)
* Convert legacy codebases into modern powerhouses using incremental migration techniques (e.g., Perl to Go), with stage gates, parity strategies, and strong observability
* Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
* Collaborate with engineering users to turn ambiguous workflows into durable production systems##
** What we need to see:
*** B.S. CS/EE (or equivalent experience)
* 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
* Strong Linux fundamentals: processes, file systems, permissions, synchronization/locks, concurrency, and debugging
* Solid distributed-systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
* Experience building long-runtime automation or services on shared compute clusters (batch schedulers, build systems)
* Ability to translate ambitious, high-level goals into a safe delivery plan (instrumentation, staged rollout, measurable outcomes)
** Ways to stand out from the crowd:
*** Hands-on experience with shared file systems at scale (NFS), or coordination patterns on eventually-consistent storage
* Experience with batch job scheduling, shared compute fleets, or build systems
* Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
* Experience partitioning/optimizing metadata-heavy systems and reducing I/O or R/W hot spots
* Strong incident/debug tactics:…


Increase/decrease your Search Radius (miles)



Job Posting Language