Site Reliability Engineer
Listed on 2026-06-24
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Unix/Linux, Systems Engineer
About the job
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault‑tolerant systems. SRE ensures that Google Cloud's services—both internally critical and externally‑visible—have reliability, uptime appropriate to customers' needs and a fast rate of improvement. SRE’s will keep an ever‑watchful eye on our systems’ capacity and performance.
Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale that are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large‑scale system design. SRE’s culture of intellectual curiosity, problem solving and openness is key to its success.
In this role, you will drive the supportability and reliability of Woodshed and Napa, two key data intelligence systems underlying Google's AI push. Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible.
Responsibilities- Lead the team in our top 2026 challenge, reducing the support cost of the products via correct provisioning, intelligent alerting, system design and deployment improvements.
- Grow the SRE team from trained on‑callers and incident responders to system partners.
- Build trust with and influence over key stakeholders to drive successful scaling of the supportability of complex systems.
- Identify problems and pain points of the team, dev partner teams, and customers; and drive solutions balancing short‑term and long‑term needs.
- Work with critical customers to give them the reliability they need for their key user journeys.
- Bachelor’s degree in Computer Science, a related technical field, or equivalent practical experience.
- 8 years of experience building and developing infrastructure or distributed systems.
- 5 years of experience troubleshooting and debugging.
- 5 years of experience building and architecting production‑quality Machine Learning (ML) systems.
- 5 years of experience programming in C++, Go, or Python.
- Master’s degree in Computer Science, or a related technical field.
- Experience in Site Reliability Engineering.
- Experience in troubleshooting and supporting applications like web services, data storage, databases, data pipelines, commerce engines, with Linux/Unix or other operating systems.
US: $207,000 - $301,000 (USD) + 20% bonus target + equity + benefits.
Equal Employment OpportunityGoogle is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents‑to‑be, criminal histories consistent with legal requirements, or any other basis protected by law.
See also Google's EEO Policy , Know your rights: workplace discrimination is illegal , Belonging at Google , and How we hire .
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).