Lead Site Reliability Engineer - Infrastructure
Listed on 2026-05-30
-
Software Development
Role Overview
We are seeking a Lead Site Reliability Engineer (Infrastructure) to act as technical lead for our Infrastructure SRE team in a fast‑moving VSaaS engineering organization. In this role you will own the team’s technical direction and execution across reliability, scalability, and operability of our shared platform and production systems, combining hands‑on technical leadership with responsibility for team outcomes.
You will define SRE strategy and guide architecture across our GCP and Kubernetes ecosystem, setting standards for reliability, scalability, Git Ops, and observability. You will also mentor senior and staff engineers, lead incident response and high‑impact operational work, and contribute hands‑on when needed.
With a system‑wide view of the platform, you will guide architectural decisions, surface non‑obvious risks, and drive long‑term improvements to system reliability and operability.
Working closely with product and platform teams, you will shape the developer experience and ensure engineering teams can ship with speed and confidence. You will set engineering standards and continuously evolve our Git Ops and observability practices.
This role requires strong expertise in cloud infrastructure, distributed systems, and CI/CD, along with hands‑on experience in Golang and/or Python to support automation and long‑term system reliability.
Responsibilities- Team Leadership & Execution Ownership: Own technical direction and execution of the Infrastructure SRE team. Translate platform goals into actionable plans, ensuring alignment on priorities, reliability outcomes, and operational excellence across production systems.
- Production Operations & Incident Management: Operate and evolve large‑scale distributed systems in production, proactively identifying failure modes and mitigating risk. Own day‑to‑day operations including monitoring, alerting, incident response, coordination, post‑incident analysis, and continuous improvement.
- Architecture, Standards & Platform Governance: Provide architectural leadership across platform and infrastructure changes, identifying scalability constraints, system design risks, and long‑term reliability gaps. Define and enforce engineering standards for GCP, Kubernetes, and ArgoCD, ensuring consistent, secure, Git Ops‑based delivery.
- Reliability Engineering & Observability: Lead strategy for monitoring, alerting, and system observability, driving a shift from reactive incidents to proactive reliability engineering.
- Enablement, CI/CD &
Collaboration:
Guide CI/CD and cloud‑native delivery practices at scale to ensure safe, scalable releases. Mentor senior and staff engineers, conduct high‑impact design and code reviews (Golang/Python), and partner with product and engineering teams to embed system‑level thinking across development. - Hands‑on Technical Contribution: Provide hands‑on technical contribution where needed, including debugging production issues, reviewing and contributing to code, and supporting critical incident resolution to ensure system reliability and team effectiveness.
- Leadership &
Experience:
10+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering, including demonstrated experience leading technical engineering teams, driving roadmaps, and owning delivery of large‑scale production systems. - Cloud & Distributed Systems Expertise: Deep experience with cloud‑native architectures and distributed systems at scale, particularly in GCP and Kubernetes environments. Ability to reason about system design, identify failure modes, and evaluate scalability and reliability risks.
- Git Ops & Delivery Engineering: Strong experience with Git Ops‑based delivery workflows, particularly ArgoCD, and CI/CD pipeline design. Ability to ensure safe, repeatable, and observable production deployments.
- Infrastructure & Automation: Strong hands‑on background in infrastructure‑as‑code (Terraform preferred), automation, and operational tooling. Proficiency in Golang and/or Python for building and reviewing production systems. Strong Linux systems knowledge and production troubleshooting experience.
- Observa…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).