Senior Site Reliability Engineering - Infrastructure Job Germany Ohio USA,IT/Tech

Location: Germany

Senior Site Reliability Engineering – Infrastructure

Look to apply for the Senior Site Reliability Engineering – Infrastructure role at NVIDIA
. Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline that designs, builds and maintains large‑scale production systems with high efficiency and availability, combining software and systems engineering practices. It demands knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment, and open‑source cloud enabling technologies such as Kubernetes and Open Stack.

SRE at NVIDIA ensures that our internal and external GPU cloud services run with maximum reliability and uptime while enabling developers to make changes to the existing system through careful preparation and planning, keeping an eye on capacity, latency and performance. SRE is also a mindset and a set of engineering approaches that focus on eliminating manual work through automation, performance tuning and growing efficiency of production systems.

We tackle problems across a broad spectrum using a variety of tools and approaches, aiming to reduce reactive operational work, conduct blameless post‑mortems, and proactively identify potential outages to drive continuous improvement. Our culture celebrates diversity, curiosity, problem‑solving and openness, supporting self‑direction and mentoring to help engineers grow.

What You’ll Be Doing

Design, implement and support operational and reliability aspects of large‑scale Kubernetes clusters with a focus on performance at scale, real‑time monitoring, logging and alerting.
Engage in and improve the entire lifecycle of services—from inception and design through deployment, operation and refinement.
Support services before they go live through system‑design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
Maintain services once live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through automation and evolve them by advocating changes that improve reliability and velocity.
Practice sustainable incident response and blameless post‑mortems.
Be part of an on‑call rotation to support production systems.

What We Need To See

BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
5+ years of experience with infrastructure automation, distributed systems design and building tools for running large‑scale private or public cloud systems in production.
Experience in one or more of the following:
Python, Go, Perl or Ruby.
In‑depth knowledge of Linux, networking and containers.

Ways to Stand Out From the Crowd

Interest in crafting, analyzing and fixing large‑scale distributed systems.
A systematic problem‑solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug, optimize code and automate routine tasks.
Experience using or running large private and public cloud systems based on Kubernetes, Open Stack and Docker.

NVIDIA is widely considered to be one of the technology world’s most desirable employers, with forward‑thinking and hard‑working people working for us. If you are creative, autonomous and love a challenge, we want to hear from you.

Seniority level:
Mid‑Senior level.

Employment type:

Full‑time. Job function:
Computer Hardware Manufacturing, Software Development, and Computers and Electronics Manufacturing.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language