Executive Director,Digital SRE & Operations Job Austin area,Texas USA,IT/Tech

Overview

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

The Executive Director, Site Reliability Engineering (SRE) will lead the strategy, execution, and evolution of enterprise-scale reliability, availability, and operational excellence across the Digital Technology organization.

This role is accountable for end-to-end reliability of web, mobile, API, platform, and AI-enabled systems that serve millions of users. The Executive Director will establish modern SRE practices,
AI-driven operations (AIOps), Dev Ops automation, and reliability-by-design principles—ensuring platforms are resilient, scalable, secure, and cost-efficient
.

You will partner closely with Digital Platform Engineering, Digital Experience, AI Platform, Client Integrations, Security, and Architecture to embed reliability into every layer of the digital ecosystem.

Responsibilities

1. SRE Strategy & Reliability Leadership

• Define and own the enterprise SRE strategy
, including SLOs, SLIs, error budgets
, and reliability roadmaps.

• Establish reliability standards and practices across web, mobile, backend services, APIs, data platforms, and AI workloads
.

• Drive a culture of reliability-by-design and operational excellence across engineering teams.

2. AI-Driven Operations (AIOps) & Automation

• Lead adoption of AIOps capabilities for proactive issue detection, alert noise reduction, and predictive failure prevention.

• Implement AI-assisted incident triage
, automated runbooks, root-cause analysis, and self-healing systems.

• Partner with the AI Platform team to integrate LLMs and ML models into operational workflows (log summarization, anomaly detection, remediation).

3. Observability & Monitoring

• Own enterprise observability strategy across metrics, logs, traces, and user experience monitoring
.

• Standardize tooling and practices using platforms such as Datadog, Splunk, Prometheus, Grafana, Open Telemetry
.

• Deliver real-time dashboards and executive reporting on uptime, performance, latency, and error budgets.

4. Dev Ops, CI/CD & Release Reliability

• Partner with Dev Ops and Platform teams to ensure safe, automated, and scalable CI/CD pipelines
.

• Enable progressive delivery patterns (blue/green, canary, feature flags) to minimize blast radius.

• Ensure quality gates, rollback mechanisms, and deployment automation are embedded into delivery pipelines.

5. Incident Management & Operational Excellence

• Lead enterprise incident response, escalation, and post-incident learning (blameless postmortems).

• Reduce MTTR, MTTD
, and incident frequency through automation and preventive engineering.

• Establish runbooks, on-call models, and operational readiness reviews.

6. Cloud Reliability & Fin Ops

• Ensure reliability and scalability across cloud environments (Azure, GCP, AWS).

• Partner with Finance and Platform Engineering to drive Fin Ops
, cost transparency, and capacity planning.

• Optimize performance, availability, and cost across high-traffic digital workloads.

7. Leadership & Talent Development

• Build, mentor, and lead global SRE teams, managers, and technical leaders.

• Define SRE career paths, skill frameworks, and training programs.

• Foster a culture of learning, accountability, and continuous improvement.

Required Qualifications

18+ years of experience in software engineering, platform operations, or site reliability engineering.

8+ years leading large-scale SRE, Dev Ops, or platform reliability organizations
.

Experience leveraging AI/ML for operations
, including anomaly detection, predictive alerts, log analysis, or automated remediation.

Familiarity with AIOps tools such as Datadog Watchdog, Dynatrace Davis, Splunk AI, Elastic AIOps
, or custom ML/LLM solutions.

Understanding of how to safely operate and monitor AI-enabled production systems
.

Deep expertise in…


Increase/decrease your Search Radius (miles)



Job Posting Language

Executive Director, Digital SRE & Operations