Senior Site Reliability Engineering - Infrastructure
Listed on 2025-12-25
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, Network Engineer
Senior Site Reliability Engineering – Infrastructure
Look to apply for the Senior Site Reliability Engineering – Infrastructure role at NVIDIA
. Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline that designs, builds and maintains large‑scale production systems with high efficiency and availability, combining software and systems engineering practices. It demands knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment, and open‑source cloud enabling technologies such as Kubernetes and Open Stack.
SRE at NVIDIA ensures that our internal and external GPU cloud services run with maximum reliability and uptime while enabling developers to make changes to the existing system through careful preparation and planning, keeping an eye on capacity, latency and performance. SRE is also a mindset and a set of engineering approaches that focus on eliminating manual work through automation, performance tuning and growing efficiency of production systems.
We tackle problems across a broad spectrum using a variety of tools and approaches, aiming to reduce reactive operational work, conduct blameless post‑mortems, and proactively identify potential outages to drive continuous improvement. Our culture celebrates diversity, curiosity, problem‑solving and openness, supporting self‑direction and mentoring to help engineers grow.
- Design, implement and support operational and reliability aspects of large‑scale Kubernetes clusters with a focus on performance at scale, real‑time monitoring, logging and alerting.
- Engage in and improve the entire lifecycle of services—from inception and design through deployment, operation and refinement.
- Support services before they go live through system‑design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
- Maintain services once live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through automation and evolve them by advocating changes that improve reliability and velocity.
- Practice sustainable incident response and blameless post‑mortems.
- Be part of an on‑call rotation to support production systems.
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
- 5+ years of experience with infrastructure automation, distributed systems design and building tools for running large‑scale private or public cloud systems in production.
- Experience in one or more of the following:
Python, Go, Perl or Ruby. - In‑depth knowledge of Linux, networking and containers.
- Interest in crafting, analyzing and fixing large‑scale distributed systems.
- A systematic problem‑solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Ability to debug, optimize code and automate routine tasks.
- Experience using or running large private and public cloud systems based on Kubernetes, Open Stack and Docker.
NVIDIA is widely considered to be one of the technology world’s most desirable employers, with forward‑thinking and hard‑working people working for us. If you are creative, autonomous and love a challenge, we want to hear from you.
Seniority level:
Mid‑Senior level.
Employment type:
Full‑time. Job function:
Computer Hardware Manufacturing, Software Development, and Computers and Electronics Manufacturing.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).