Principal Architect - Cloud and Observability
Listed on 2026-06-04
-
IT/Tech
Systems Engineer, Cloud Computing
We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.
PositionSummary
We’re hiring a Principal Architect to take ownership of how we do observability and hybrid cloud at CVS Health. This person will sit within our Enterprise Architecture organization and be responsible for the architecture, standards, and technical direction behind our observability platforms and our multi‑cloud infrastructure posture. We run workloads across on‑prem private cloud (Open Shift, KVM, Dell Power Flex), Azure, AWS, and GCP.
We need someone who can build and maintain the reference architectures, telemetry standards, and instrumentation patterns that let our engineering teams monitor all of that consistently. We've committed to an Open Telemetry‑first approach and use the Grafana stack (Mimir, Loki, Tempo) as our primary backends, but we also operate Datadog, Splunk, and Dynatrace in various parts of the org. On the cloud side, there is real work to do around workload identity, runtime selection, autoscaling guidance, and Fin Ops.
Teams are asking for concrete standards they can follow. This is a hands‑on role. You'll write architecture docs, build proof‑of‑concepts, configure OTel pipelines, and present to leadership.
This position can work remotely from anywhere in the continental USA.
Responsibilities Observability- Own the enterprise observability reference architecture covering metrics, logs, traces, and events across all environments (cloud and on‑prem).
- Drive the Open Telemetry‑first instrumentation strategy – standard libraries, semantic conventions, collector topologies (Daemon Set, gateway, sidecar), and pipeline design.
- Build and operate telemetry pipelines on Grafana Mimir, Loki, and Tempo, including multi‑tenant configurations, retention policies, and capacity planning.
- Define how we measure reliability: SLOs, SLIs, error budgets, and alerting frameworks – consistently across all lines of business.
- Own the integration between observability tooling and incident management (Service Now ITOM, xMatters).
- Build and maintain reference architectures for our hybrid footprint:
Open Shift on‑prem with KVM/libvirt and Dell Power Flex storage, plus Azure, AWS, and GCP. - Lead standards work around workload identity and federation using SPIFFE/SPIRE and cloud‑native IAM patterns to move away from static secrets.
- Provide guidance on compute runtime selection – containers vs. VMs vs. bare metal vs. serverless – with a clear decision framework for teams.
- Help teams connect autoscaling and capacity planning behavior to actual telemetry signals.
- Push Fin Ops maturity forward by integrating cost data into the observability stack, establishing unit economics, and working toward open billing standards like FOCUS.
- Identify where AI/ML adds practical value in our observability stack – anomaly detection, root cause analysis, log clustering, and smarter alerting.
- Define observability standards for AI‑powered systems (agents, RAG pipelines) – covering latency, token costs, model drift, and related signals.
- Ensure new AI‑powered platforms are instrumented correctly from day one.
- Participate in cross‑functional architecture working groups focused on observability and hybrid cloud standards.
- Publish architecture decision records and reference implementations that teams can actually use.
- Mentor architects and platform engineers; conduct architecture reviews to raise the bar across the org.
- Work with security and compliance on HIPAA, SOX, and PCI requirements as they apply to telemetry and cloud infrastructure.
- Represent CVS Health in vendor evaluations and stay connected to the open source ecosystem (CNCF, Open Telemetry, Grafana Labs).
- 10+ years in infrastructure, cloud architecture,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).