Senior Manager, Site Reliability Engineering; SRE
Listed on 2025-12-07
-
IT/Tech
Systems Engineer, Cloud Computing
Overview
The Senior Manager, Site Reliability Engineering (SRE) will lead the SRE organization to deliver reliable, scalable, and resilient platforms and services, owning the strategy, implementation, and continuous improvement of a unified observability platform that provides end‑to‑end visibility into infrastructure, applications, APM, and databases, enabling proactive issue detection, faster incident resolution, and improved customer experience.
Key Responsibilities- Hire, lead, and mentor a high‑performing SRE team across geographies.
- Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.
- Establish a healthy 24x7 on‑call model, ensuring coverage while promoting team well‑being.
- Drive a blameless culture through structured post‑mortems and RCA follow‑up actions.
- Build and manage a unified observability platform leveraging New Relic, Datadog, Cloud Watch, Prometheus, Grafana, Graylog, and Open Telemetry.
- Deliver holistic monitoring across infrastructure, applications, databases, APIs, and end‑user experience.
- Implement APM to trace performance across distributed systems.
- Establish dashboards, metrics, and proactive alerting to identify anomalies early.
- Drive adoption of AIOps and predictive analytics for proactive reliability improvements.
- Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.
- Partner with engineering teams to balance velocity with reliability, ensuring adherence to Error Budgets.
- Reduce MTTD and MTTR through automation, faster detection, and better instrumentation.
- Perform capacity planning, scalability reviews, and resiliency testing.
- Lead major incident response, coordinating communications with executives and stakeholders.
- Drive root cause analysis and implement long‑term fixes.
- Partner with ITSM teams to align with incident, problem, and change management processes.
- Ensure continuous improvement loops from incidents back into observability, automation, and engineering practices.
- Collaborate with Engineering, Product, Security, Cloud, and Dev Ops teams to embed SRE practices.
- Provide guidance on instrumentation, reliability design, and operational readiness for new services.
- Partner with DBAs and data platform teams to monitor database health, replication, query performance, and failover readiness.
- Champion reliability as a shared responsibility across development and operations.
- 5+ years in SRE, Operations, or Infrastructure Engineering with 2+ years in leadership roles.
- Proven expertise in unified observability, monitoring, and alerting across infra, apps, APM, and databases.
- Strong knowledge of observability tools:
New Relic, Datadog, Prometheus, Grafana, Graylog, Cloud Watch, Open Telemetry, Solar Winds. - Hands‑on with incident response, RCA, MTTR/MTTD reduction, and on‑call management.
- Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.
- Strong AWS experience (EC2, ECS, EKS, networking, scaling groups).
- Hands‑on with containers & orchestration (Docker, Kubernetes).
- Proficiency in Python, Java, C#, and shell scripting for automation.
- Knowledge of networking fundamentals, distributed systems, and high‑availability architectures.
- Familiarity with ITIL/ITSM processes (incident, problem, change).
- Strong leadership, stakeholder management, and communication skills.
- Experience in large‑scale SaaS or product‑driven environments.
- Hands‑on experience with databases:
Mongo
DB, Elasticsearch, SQL Server, Oracle. - Experience with chaos engineering, resiliency testing, and disaster recovery planning.
- Certifications:
AWS Solutions Architect / Dev Ops Engineer, CKAD/CKA. - Experience managing global SRE teams across time zones.
- Proven ability to embed reliability into engineering culture via SLOs and Error Budgets.
Estimated Salary Range: $143,000 – $191,000 plus bonus. Benefits include health, vision, dental insurance, accident and life insurance, 401k matching, paid‑time off, and education reimbursement.
EEO & Company InformationGlobal Healthcare Exchange, LLC provides equal employment opportunities to all employees and applicants for employment without regard to race, color, national origin, sex, sexual orientation, gender identity, religion, age, genetic information, disability, veteran status or any other status protected by applicable law. All qualified applicants will receive consideration for employment without regard to any status protected by applicable law. GHX expects a discrimination and harassment‑free atmosphere and requires employees to cooperate in maintaining it.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).