Observability Platform Engineer; SRE Job Scottsdale area,Arizona USA,IT/Tech

Position: Staff Observability Platform Engineer (SRE)
## Staff Observability Platform Engineer (SRE)
Apply remote type:
Hybrid locations:
AZ - Scottsdale time type:
Full time posted on:
Posted Todaytime left to apply:
End Date:
June 1, 2026 (30+ days left to apply) job requisition :
R0897987

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.
** POSITION SUMMARY
** CVS Health PBM is looking for hands-on, passionate people who want to join a high energy and growing team, who want to be on the forefront of digital innovation that aims to reinvent what a pharmacy and a health care company can be in the digital world. As a
** Lead Platform Reliability Engineer**, you will design and implement metrics and observability frameworks with a strong focus on service level objectives (SLOs), service level indicators (SLIs), error budgets, and cloud infrastructure scaling and capacity estimation.

This individual contributor role is critical to enhancing our monitoring and observability capabilities, while also driving automation initiatives related to quality gates within the release engineering process. You will work closely with cross‐functional teams to ensure the reliability, performance, and scalable growth of our cloud‐based systems.
*** Expectations for the Role:
***** Metrics Development:
** Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance. Ensure alignment with business objectives and operational goals.
** Error Budgets:
** Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery. Analyze incidents and outages to inform adjustments to error budgets.
** Monitoring & Observability:
** Design and implement comprehensive monitoring solutions to provide real-time visibility into system health. Utilize tools such as Prometheus, Grafana, Loki, Temp and other observability platforms to create dashboards and alerts.
** Cloud Infrastructure Scaling:
** Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.
** Quality Gates Automation:
** Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards. Lead the release Devops team to integrate these gates into the CI/CD pipeline.
** Incident Management:
** Assist in incident response efforts by providing insights from metrics and monitoring tools. Conduct post-mortem analyses to identify root causes and recommend preventive measures.
** REQUIRED QUALIFICATIONS
*** 10+ years of experience in Software Engineering, Platform Engineering, or SRE.
* 7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
* 7+ years building production-grade backend services in Java/python.
* 7+ years implementing and operating Open Telemetry, including OTLP, semantic conventions, and instrumentation patterns.
* 7+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).
* 7+ years working with public cloud platforms (AWS, GCP, or Azure).
* 5+ years designing and scaling distributed, high‐volume data pipelines.
* 5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).
* 5+ years with relational databases (PostgreSQL, MySQL).
** PREFERRED QUALIFICATIONS
*** Excellent analytical skills and the ability to communicate complex technical concepts to non-technical stakeholders
* Experience with service meshes and networking technologies such as Envoy and Istio
* Experience integrating or operating commercial observability platforms (Splunk, App Dynamics, etc.)
* Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies
*…