Executive Director, AI Infrastructure & Platform Engineering
Listed on 2026-06-24
-
IT/Tech
Systems Engineer
Executive Director, AI Infrastructure & Platform Engineering
The Executive Director, AI Infrastructure & Platform Engineering is a senior engineering leadership role responsible for standing up, operating, and continuously improving CVS Health's on‑premises AI compute platform. This position owns the physical and platform layers of CVS's Enterprise AI Factory – a frontier‑class GPU compute environment running NVIDIA Blackwell systems across a high‑throughput RoCE v2 fabric, hosted in co‑located data center facilities, with multi‑site expansion underway.
Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, this leader will establish operational baselines across the full infrastructure stack – hardware, network fabric, GPU clusters, storage, and the operating systems and orchestration layers above – and build the Site Reliability Engineering practice that delivers the availability, reliability, and performance that frontier AI workloads demand. This is a greenfield organizational build.
The Executive Director will define the operating model, set the engineering standards, hire and develop the team, and establish the long‑term operations capability that will govern CVS’s AI infrastructure for years ahead.
Strategy and Leadership:
- Define and execute the long‑range vision and strategy for AI infrastructure and platform engineering, with availability (>99.99%), reliability, and platform performance as the primary measures of success.
- Recruit, hire, develop, and retain a high‑performing engineering organization spanning infrastructure, network, platform reliability, observability, security, 24/7 operations, change and release management, and Fin Ops.
- Establish clear ownership, accountability, and performance expectations across all functional teams; foster a culture of operational excellence, engineering rigor, and continuous improvement.
- Provide executive‑level communication to senior leadership on platform status, milestones, risk posture, and strategic initiatives.
Infrastructure and Platform Engineering:
- Own the physical layer of the AI compute environment – GPU compute, storage, network fabric, capacity planning, and hardware lifecycle accountability.
- Direct bare‑metal Kubernetes and Open Shift operations, including cluster administration, GPU quota governance, infrastructure‑as‑code adoption, and availability baseline enforcement.
- Govern high‑performance network fabric operations – RoCE v2, spine‑leaf topology, lossless Ethernet tuning, congestion management, and segmentation.
- Establish and enforce operational baselines across every layer of the stack – hardware, fabric, platform, and workload – with deviations detected, escalated, and resolved within defined SLAs.
- Direct Innovation POD strategy to develop self‑healing and autonomous capabilities that proactively prevent service degradation before it impacts availability.
Operations and Reliability:
- Build and sustain a high‑performing 24/7 operations model – designed for sustainable, predictable coverage with no mandatory overtime and measurable team health and retention.
- Drive end‑to‑end observability across the physical and platform layers, with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles.
- Oversee change management so every modification is risk‑assessed, monitored during rollout, and baseline‑validated post‑deployment.
- Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time.
- Lead GPU Fin Ops governance – utilization optimization, tenant quota enforcement, and cost reduction – in partnership with the Finance organization.
Security and Compliance:
- Empower the Security SRE Lead to maintain a world‑class security posture across the infrastructure and platform layers, with robust compliance to frameworks including HIPAA and NIST AI RMF.
- Govern access controls, audit logging, vulnerability management, and network segmentation across the AI compute environment.
Program Transition and Operating Model:
- Lead the operational transition from program‑launch staffing to permanent…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).