Site Reliability AI Engineer Job Scottsdale area,Arizona USA,IT/Tech

Position: Staff Site Reliability AI Engineer

At CVS Health, we’re building a world of health around every consumer and surrounding ourselves with dedicated colleagues who are passionate about transforming health care.

As the nation’s leading health solutions company, we reach millions of Americans through our local presence, digital channels and more than 300,000 purpose-driven colleagues – caring for people where, when and how they choose in a way that is uniquely more connected, more convenient and more compassionate. And we do it all with heart, each and every day.

Position Summary

The PCW (Pharmacy & Consumer Wellness) SRE team is seeking a Staff Site Reliability Engineer (SRE) to lead the reliability, scalability, and security of our Conversational and Generative AI Platform . This platform enables deployment and orchestration of open‑source Large Language Models (LLMs), supports advanced AI use cases such as model fine‑tuning, retrieval‑augmented generation (RAG), and multi‑agent systems, and powers next‑generation conversational and predictive experiences.

The ideal candidate will combine deep expertise in cloud‑native infrastructure, security compliance, and AI‑driven systems with a proven ability to drive automation, observability, and resilience across mission‑critical platforms.

Key Responsibilities

Architect and maintain highly available, secure, and scalable infrastructure for AI‑driven applications and services.
Automate end‑to‑end workflows, from infrastructure provisioning to application deployment and incident response, eliminating operational toil.
Champion security‑first principles, embedding compliance with HIPAA, PCI DSS, and ADA standards into all processes; partner with enterprise security teams to ensure governance.
Implement observability best practices, leveraging tools like Prometheus, Grafana, and Istio to monitor system health and performance.
Collaborate cross‑functionally to troubleshoot complex platform, service, and data issues; perform root cause analysis and implement preventive measures.
Mentor and guide engineers in SRE and Dev Ops best practices, fostering a culture of reliability and continuous improvement.
Drive innovation in AI infrastructure, optimizing for GPU/TPU resource management, distributed training, and orchestration of AI workloads in Kubernetes environments.
Support audits and compliance reviews, ensuring timely implementation of recommendations and adherence to security standards.

Required Qualifications

10+ years of experience in IT and Digital solution development
, with a proven track record of delivering enterprise‑scale systems.
Demonstrated leadership in Site Reliability Engineering (SRE), including managing 24/7 operations and driving system resilience.
CISSP certification with deep expertise in cloud security, network security, application security, and compliance standards (HIPAA, PCI DSS, ADA).
Strong knowledge of cloud security architectures, networking fundamentals (DNS, WAF, DHCP, Firewalls, IP routing), and secure application design.
Expertise in modern AI paradigms, including Generative AI, Large Language Models (LLMs), Retrieval‑Augmented Generation (RAG), and multi‑agent systems.
Hands‑on experience with AI frameworks and platforms (e.g., Tensor Flow, PyTorch, Hugging Face) and orchestration of AI pipelines for production environments.
5+ years of experience with MLOps practices, including model lifecycle management, monitoring, and continuous improvement in cloud‑native environments.
Ability to evaluate emerging AI technologies and integrate them into scalable architectures for predictive analytics, NLP, and computer vision.
Leadership in ethical AI and governance, ensuring compliance with data privacy, bias mitigation, and responsible AI principles.
5+ years of experience in Kubernetes and Docker containerization, with hands‑on experience in Rancher and Google Kubernetes Engine (GKE).
5+ year of experience on Cloud Technologies (GCP Preferred), Microservices and web APIs.
5+ years of experience in Implementing Dev Ops, Git Ops, Grafana, Istio, Prometheus.

Preferred Qualifications

Cloud Architect certification in AWS, Azure, or Google Cloud.
Advanced experience with relational and No

SQL…


Increase/decrease your Search Radius (miles)



Job Posting Language