Executive Director, Digital SRE & Operations
Listed on 2026-02-07
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Overview
We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.
The Executive Director, Site Reliability Engineering (SRE) will lead the strategy, execution, and evolution of enterprise-scale reliability, availability, and operational excellence across the Digital Technology organization.
This role is accountable for end-to-end reliability of web, mobile, API, platform, and AI-enabled systems that serve millions of users. The Executive Director will establish modern SRE practices,
AI-driven operations (AIOps), Dev Ops automation, and reliability-by-design principles—ensuring platforms are resilient, scalable, secure, and cost-efficient
.
You will partner closely with Digital Platform Engineering, Digital Experience, AI Platform, Client Integrations, Security, and Architecture to embed reliability into every layer of the digital ecosystem.
Responsibilities1. SRE Strategy & Reliability Leadership
• Define and own the enterprise SRE strategy
, including SLOs, SLIs, error budgets
, and reliability roadmaps.
• Establish reliability standards and practices across web, mobile, backend services, APIs, data platforms, and AI workloads
.
• Drive a culture of reliability-by-design and operational excellence across engineering teams.
2. AI-Driven Operations (AIOps) & Automation
• Lead adoption of AIOps capabilities for proactive issue detection, alert noise reduction, and predictive failure prevention.
• Implement AI-assisted incident triage
, automated runbooks, root-cause analysis, and self-healing systems.
• Partner with the AI Platform team to integrate LLMs and ML models into operational workflows (log summarization, anomaly detection, remediation).
3. Observability & Monitoring
• Own enterprise observability strategy across metrics, logs, traces, and user experience monitoring
.
• Standardize tooling and practices using platforms such as Datadog, Splunk, Prometheus, Grafana, Open Telemetry
.
• Deliver real-time dashboards and executive reporting on uptime, performance, latency, and error budgets.
4. Dev Ops, CI/CD & Release Reliability
• Partner with Dev Ops and Platform teams to ensure safe, automated, and scalable CI/CD pipelines
.
• Enable progressive delivery patterns (blue/green, canary, feature flags) to minimize blast radius.
• Ensure quality gates, rollback mechanisms, and deployment automation are embedded into delivery pipelines.
5. Incident Management & Operational Excellence
• Lead enterprise incident response, escalation, and post-incident learning (blameless postmortems).
• Reduce MTTR, MTTD
, and incident frequency through automation and preventive engineering.
• Establish runbooks, on-call models, and operational readiness reviews.
6. Cloud Reliability & Fin Ops
• Ensure reliability and scalability across cloud environments (Azure, GCP, AWS).
• Partner with Finance and Platform Engineering to drive Fin Ops
, cost transparency, and capacity planning.
• Optimize performance, availability, and cost across high-traffic digital workloads.
7. Leadership & Talent Development
• Build, mentor, and lead global SRE teams, managers, and technical leaders.
• Define SRE career paths, skill frameworks, and training programs.
• Foster a culture of learning, accountability, and continuous improvement.
18+ years of experience in software engineering, platform operations, or site reliability engineering.
8+ years leading large-scale SRE, Dev Ops, or platform reliability organizations
.
Experience leveraging AI/ML for operations
, including anomaly detection, predictive alerts, log analysis, or automated remediation.
Familiarity with AIOps tools such as Datadog Watchdog, Dynatrace Davis, Splunk AI, Elastic AIOps
, or custom ML/LLM solutions.
Understanding of how to safely operate and monitor AI-enabled production systems
.
Deep expertise in…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).