System Software Engineer, Distributed Systems
Listed on 2026-06-02
-
Software Development
DevOps, Software Engineer, Cloud Engineer - Software, Backend Developer
The VLSI Productivity and Infrastructure team supports 1000+ chip design engineers by building tools and platforms that supercharge their everyday work. Our mission is to make chip designers faster.
We build and operate long shelf‑life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization, with a strong commitment to stability. Our core workflow infrastructure runs as userspace software on bare‑metal Linux hosts (no sudo, no containers). We coordinate shared state and artifacts via NFS, launch long‑running, compute‑heavy workflows on IBM LSF, and provide adjacent services for APIs and observability.
This is a high‑ownership environment where you'll often be the expert on what you build.
- Design, build, and deliver core components of our next‑generation productivity platforms
- Develop reliable userspace infrastructure for long‑running engineering workflows at scale on bare‑metal Linux hosts
- Build state coordination over NFS (atomicity, idempotency/dedup, partial‑write recovery, without privileged ops)
- Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/back pressure)
- Convert legacy codebases into modern powerhouses using incremental migration techniques (e.g., Perl to Go), with stage gates, parity strategies, and strong observability
- Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
- Collaborate with engineering users to turn ambiguous workflows into durable production systems
- B.S. CS/EE (or equivalent experience)
- 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
- Strong Linux fundamentals: processes, file systems, permissions, synchronization/locks, concurrency, and debugging
- Solid distributed‑systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
- Experience building long‑runtime automation or services on shared compute clusters (batch schedulers, build systems)
- Ability to translate ambitious, high‑level goals into a safe delivery plan (instrumentation, staged rollout, measurable outcomes)
- Hands‑on experience with shared file systems at scale (NFS), or coordination patterns on eventually‑consistent storage
- Experience with batch job scheduling, shared compute fleets, or build systems
- Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
- Experience partitioning/optimizing metadata‑heavy systems and reducing I/O or R/W hot spots
- Strong incident/debug tactics: clear root‑cause analysis, remediation, and guardrails as well as rapid comprehension and ownership of unfamiliar codebases in any language (including LLM‑generated code) to implement high‑leverage changes
With competitive salaries and a generous benefits package, we are widely considered to be one of the technology world’s most desirable employers.
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level3, and 184,000 USD - 287,500 USD for Level
4.
You will also be eligible for equity and benefits.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).