Senior Manager, Site Reliability Engineering; SRE
Listed on 2025-12-19
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization to deliver reliable, scalable, and resilient platforms and services. This role will own the strategy, implementation, and continuous improvement of a unified observability platform that provides end-to-end visibility into infrastructure, applications, APM, and databases, enabling proactive issue detection, faster incident resolution, and improved customer experience.
The Sr. Manager will drive practices around SLIs, SLOs, SLAs, and Error Budgets, embedding reliability into engineering culture. They will oversee incident management, RCA, proactive alerting, predictive analysis, and automation, while ensuring close collaboration with engineering, product, and platform teams.
Key Responsibilities Leadership & Team Management- Hire, lead, and mentor a high-performing SRE team across geographies.
- Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.
- Establish a healthy 24x7 on-call model, ensuring coverage while promoting team well-being.
- Drive a blameless culture through structured postmortems and RCA follow-up actions.
- Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, Cloud Watch, Prometheus, Grafana, Graylog, and Open Telemetry.
- Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end-user experience.
- Implement APM (Application Performance Monitoring) to trace performance across distributed systems.
- Establish dashboards, metrics, and proactive alerting to identify anomalies early.
- Drive adoption of AIOps and predictive analytics for proactive reliability improvements.
- Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.
- Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets.
- Reduce MTTD (Mean Time to Detect) and MTTR (Mean Time to Resolve) through automation, faster detection, and better instrumentation.
- Perform capacity planning, scalability reviews, and resiliency testing.
- Lead major incident response, coordinating communications with executives and stakeholders.
- Drive root cause analysis (RCA) and implement long-term fixes.
- Partner with ITSM teams to align with incident, problem, and change management processes.
- Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices.
- Collaborate with Engineering, Product, Security, Cloud, and Dev Ops teams to embed SRE practices.
- Provide guidance on instrumentation, reliability design, and operational readiness for new services.
- Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness.
- Champion reliability as a shared responsibility across development and operations.
- 12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles.
- Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases.
- Strong knowledge of observability tools:
New Relic, Datadog, Prometheus, Grafana, Graylog, Cloud Watch, Open Telemetry, Solar Winds. - Hands‑on with incident response, RCA, MTTR/MTTD reduction, and on‑call management.
- Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.
- Strong AWS experience (EC2, ECS, EKS, networking, scaling groups).
- Hands‑on with containers & orchestration (Docker, Kubernetes).
- Proficiency in Python, Java, C#, & shell scripting for automation.
- Knowledge of networking fundamentals, distributed systems, and high‑availability architectures.
- Familiarity with ITIL/ITSM processes (incident, problem, change).
- Strong leadership, stakeholder management, and communication skills.
- Experience in large‑scale SaaS or product‑driven environments.
- Hands‑on experience with databases:
Mongo
DB, Elasticsearch, SQL Server, Oracle. - Experience with chaos engineering, resiliency testing, and disaster recovery planning.
- Certifications:
AWS…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).