Executive Director, AI Ops Engineering
Listed on 2026-06-03
-
IT/Tech
Systems Engineer, Cloud Computing
We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time.
ExecutiveDirector, AI Platform SRE About the Role
CVS Health is seeking an Executive Director, AI Ops Engineering to build and lead a team of professionals responsible for the continuous operation, monitoring, and optimization of CVS's Enterprise AI environment. This is first and foremost an engineering leadership role – your core accountability is ensuring the platform is always on, always performing, and always improving.
CVS Health's AI platform is a critical enterprise asset powering clinical, operational, and consumer capabilities at scale across one of the nation's largest healthcare organizations. Keeping it reliable, observable, and continuously improving is the mission. Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, you will establish and maintain operational baselines across the full infrastructure stack, ensure all changes are continuously monitored, observed, and adjusted, and drive the highest levels of availability, reliability, and scalability across every layer of the environment.
This is a greenfield organizational build – the person in this role will define the operating model, shape the team culture, and establish the engineering standards that will govern CVS's AI infrastructure for years ahead. If you thrive on building from the ground up, this role was designed for you.
Teams You Will LeadYou will build and lead a multi-disciplinary SRE organization structured across nine functional areas spanning core platform operations and innovation. The team is organized to ensure full-spectrum coverage of the AI environment – from hardware and network through platform reliability, security, observability, and 24/7 operations – while continuously developing advanced automation and self-healing capabilities.
- Platform Reliability – SLO/SLI/error budget management, availability baseline enforcement, cluster administration, GPU quota governance, and infrastructure-as-code
- Infrastructure – Compute, storage, and hardware lifecycle management, including compliance controls and data isolation
- Network – High-performance GPU networking, fabric management, security segmentation, and continuous network baseline enforcement
- Observability – End-to-end monitoring strategy, alerting pipelines, SLI/SLO dashboards, and the feedback loops that connect operational data to improvement
- Security SRE – Security posture, access controls, audit logging, vulnerability management, and regulatory compliance (HIPAA, NIST AI RMF)
- 24/7 Operations Center – Round-the-clock incident response, on-call protocols, escalation management, and shift-level change execution, structured for sustainable coverage with no mandatory overtime
- Change & Release Management – Change lifecycle governance, ITIL process management, compliance frameworks, Model Ops boundary definition, and platform knowledge base
- Fin Ops – GPU cost governance, utilization optimization, tenant quota enforcement, and chargeback models in partnership with Finance
In addition to core operations, you will oversee three Innovation PODs – focused on AI-driven automation, infrastructure-as-code and self-service capabilities, and chaos engineering and resilience testing – with the goal of continuously reducing manual toil and building a self-healing, self-optimizing platform over time.
What You'll Do Leadership- Own the SRE vision, strategy, and long-range roadmap with availability (>99.99%), reliability, and scalability as the primary measures of success
- Lead, develop, and integrate all functional teams into a cohesive, always-on operations organization – setting clear ownership, accountability, and performance expectations for each team and each engineer
- Establish and enforce operational baselines across…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).