Observability Platform Engineer; SRE
Listed on 2026-05-16
-
IT/Tech
Systems Engineer, Cloud Computing
POSITION SUMMARY
We're building a world of health around every individual – shaping a more connected, convenient, and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable, and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family, and one community at a time.
Responsibilities- Metrics Development: Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance. Ensure alignment with business objectives and operational goals.
- Error Budgets: Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery. Analyze incidents and outages to inform adjustments to error budgets.
- Monitoring & Observability: Design and implement comprehensive monitoring solutions to provide real‑time visibility into system health. Utilize tools such as Prometheus, Grafana, Loki, Tempo, and other observability platforms to create dashboards and alerts.
- Cloud Infrastructure Scaling: Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.
- Quality Gates Automation: Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards. Lead the release Dev Ops team to integrate these gates into the CI/CD pipeline.
- Incident Management: Assist in incident response efforts by providing insights from metrics and monitoring tools. Conduct post‑mortem analyses to identify root causes and recommend preventive measures.
- 10+ years of experience in Software Engineering, Platform Engineering, or SRE.
- 7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.
- 7+ years building production‑grade backend services in Java or Python.
- 7+ years implementing and operating Open Telemetry, including OTLP, semantic conventions, and instrumentation patterns.
- 7+ years with cloud‑native and containerized platforms (Docker, Kubernetes, Argo CD).
- 7+ years working with public cloud platforms (AWS, GCP, or Azure).
- 5+ years designing and scaling distributed, high‑volume data pipelines.
- 5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).
- 5+ years with relational databases (Postgre
SQL, MySQL).
- Excellent analytical skills and the ability to communicate complex technical concepts to non‑technical stakeholders.
- Experience with service meshes and networking technologies such as Envoy and Istio.
- Experience integrating or operating commercial observability platforms (Splunk, App Dynamics, etc.).
- Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies.
- Familiarity with time‑series, No
SQL, or analytical databases (Click House, Bigtable, Cassandra, etc.). - Experience with Infrastructure as Code tools such as Terraform or Cloud Formation.
- Experience with cost optimization and capacity planning for large‑scale cloud infra.
- Experience with chaos engineering, resiliency testing, or fault injection.
- Background in security‑aware platform design, including secure service‑to‑service communication.
- Experience mentoring senior engineers and influencing platform standards across organizations.
- Strong operational experience supporting 24x7 production systems, including on‑call responsibilities.
- Knowledge of security best practices in cloud environments.
Bachelor's degree or equivalent experience (HS diploma + 4 years relevant experience)
Pay RangeThe typical pay range for this role is: $ – $
BenefitsWe take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families. This full‑time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.
EEOStatement
Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).