×
Register Here to Apply for Jobs or Post Jobs. X

Executive Director, Digital SRE & Operations

Job in Austin, Travis County, Texas, 78716, USA
Listing for: CVS Health
Full Time position
Listed on 2026-02-07
Job specializations:
  • IT/Tech
    Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below

Overview

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

The Executive Director, Site Reliability Engineering (SRE) will lead the strategy, execution, and evolution of enterprise-scale reliability, availability, and operational excellence across the Digital Technology organization.

This role is accountable for end-to-end reliability of web, mobile, API, platform, and AI-enabled systems that serve millions of users. The Executive Director will establish modern SRE practices,
AI-driven operations (AIOps), Dev Ops automation, and reliability-by-design principles—ensuring platforms are resilient, scalable, secure, and cost-efficient
.

You will partner closely with Digital Platform Engineering, Digital Experience, AI Platform, Client Integrations, Security, and Architecture to embed reliability into every layer of the digital ecosystem.

Responsibilities
  • 1. SRE Strategy & Reliability Leadership

    • Define and own the enterprise SRE strategy
    , including SLOs, SLIs, error budgets
    , and reliability roadmaps.

    • Establish reliability standards and practices across web, mobile, backend services, APIs, data platforms, and AI workloads
    .

    • Drive a culture of reliability-by-design and operational excellence across engineering teams.

  • 2. AI-Driven Operations (AIOps) & Automation

    • Lead adoption of AIOps capabilities for proactive issue detection, alert noise reduction, and predictive failure prevention.

    • Implement AI-assisted incident triage
    , automated runbooks, root-cause analysis, and self-healing systems.

    • Partner with the AI Platform team to integrate LLMs and ML models into operational workflows (log summarization, anomaly detection, remediation).

  • 3. Observability & Monitoring

    • Own enterprise observability strategy across metrics, logs, traces, and user experience monitoring
    .

    • Standardize tooling and practices using platforms such as Datadog, Splunk, Prometheus, Grafana, Open Telemetry
    .

    • Deliver real-time dashboards and executive reporting on uptime, performance, latency, and error budgets.

  • 4. Dev Ops, CI/CD & Release Reliability

    • Partner with Dev Ops and Platform teams to ensure safe, automated, and scalable CI/CD pipelines
    .

    • Enable progressive delivery patterns (blue/green, canary, feature flags) to minimize blast radius.

    • Ensure quality gates, rollback mechanisms, and deployment automation are embedded into delivery pipelines.

  • 5. Incident Management & Operational Excellence

    • Lead enterprise incident response, escalation, and post-incident learning (blameless postmortems).

    • Reduce MTTR, MTTD
    , and incident frequency through automation and preventive engineering.

    • Establish runbooks, on-call models, and operational readiness reviews.

  • 6. Cloud Reliability & Fin Ops

    • Ensure reliability and scalability across cloud environments (Azure, GCP, AWS).

    • Partner with Finance and Platform Engineering to drive Fin Ops
    , cost transparency, and capacity planning.

    • Optimize performance, availability, and cost across high-traffic digital workloads.

  • 7. Leadership & Talent Development

    • Build, mentor, and lead global SRE teams, managers, and technical leaders.

    • Define SRE career paths, skill frameworks, and training programs.

    • Foster a culture of learning, accountability, and continuous improvement.

  • Required Qualifications
  • 18+ years of experience in software engineering, platform operations, or site reliability engineering.

  • 8+ years leading large-scale SRE, Dev Ops, or platform reliability organizations
    .

  • Experience leveraging AI/ML for operations
    , including anomaly detection, predictive alerts, log analysis, or automated remediation.

  • Familiarity with AIOps tools such as Datadog Watchdog, Dynatrace Davis, Splunk AI, Elastic AIOps
    , or custom ML/LLM solutions.

  • Understanding of how to safely operate and monitor AI-enabled production systems
    .

  • Deep expertise in…

  • To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
    (If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)

    Job Posting Language
    Employment Category
    Education (minimum level)
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary