Software Development Engineer
Listed on 2026-05-19
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Location: Tallahassee
Position Summary
Join Fortune 7 CVS Health as a Staff Software Engineer to lead and advance our Site Reliability Engineering (SRE), AIOps, Observability, and Monitoring capabilities in the CVS Digital team. This role is critical in advancing intelligent, automated, and scalable reliability practices across our platforms. You will drive the evolution from traditional monitoring to AIOps-driven operations (AIOps) leveraging automation, machine learning, and advanced analytics to improve system resilience, reduce operational toil, and accelerate incident detection and resolution.
As a technical leader, you will influence architecture, build platforms, and mentor teams to embed reliability, observability, and automation into the software delivery lifecycle.
- SRE Strategy & Reliability Engineering
- Define and implement enterprise-wide SRE practices, including SLIs, SLOs, error budgets, and reliability governance.
- Drive a culture of reliability, automation, and continuous improvement across engineering teams.
- Establish metrics-driven approaches to measure system health, availability, and performance.
- AIOps & Intelligent Operations
- Lead adoption of AIOps solutions to enable predictive monitoring, anomaly detection, and automated root cause analysis.
- Integrate machine learning models and analytics into monitoring pipelines to proactively detect and prevent incidents.
- Develop intelligent alerting systems to reduce noise and improve signal quality.
- Observability & Monitoring Platforms
- Architect and build scalable observability frameworks covering metrics, logs, traces, and events.
- Define standards for instrumentation, telemetry collection, and distributed tracing.
- Enable real-time insights into system performance across microservices and cloud-native architectures.
- Incident Management & Automation
- Lead incident response practices, including on-call readiness, RCA, postmortems, and continuous learning loops.
- Build self-healing systems and automate remediation workflows to reduce Mean Time to Resolution (MTTR).
- Implement runbooks, playbooks, and automated escalations.
- Platform Engineering & Tooling
- Develop internal platforms and tools for observability, monitoring, and performance optimization.
- Integrate observability into CI/CD pipelines to enable proactive quality and reliability checks.
- Drive infrastructure automation using IaaC frameworks and Git Ops principles.
- Collaboration & Technical Leadership
- Partner with engineering, platform, and product teams to embed reliability and observability into system design.
- Mentor engineers and lead design reviews focused on scalability, resilience, and operability.
- Influence enterprise architecture decisions and promote best practices across teams.
- 5+ years of experience in software engineering, SRE, or production engineering in large-scale distributed systems.
- Hands‑on experience with Observability tools such as App Dynamics, Grafana, Prometheus, Datadog, Open Telemetry, or similar.
- Experience with AIOps or intelligent monitoring platforms, including anomaly detection and event correlation.
- Strong expertise in cloud platforms (AWS, Azure, or GCP) and cloud‑native architectures (Kubernetes, containers, microservices).
- Proficiency in at least one programming language (e.g., Python, Java, Go).
- Strong understanding of distributed systems, resiliency patterns, and fault tolerance.
- Experience implementing incident management, on‑call processes, and root cause analysis.
- Hands‑on expertise with Infrastructure as Code (Terraform, ARM, Cloud Formation) and CI/CD pipelines.
- Experience using GenAI/Automation tools and frameworks such as OpenAI, CoPilot, Gemini, Claude, MCP etc.
- Proven ability to design scalable, reliable, and observable systems.
- Experience designing and implementing AIOps platforms or predictive reliability systems at scale.
- Strong knowledge of machine learning applications in IT operations (e.g., anomaly detection, forecasting, clustering).
- Experience defining and managing SLIs/SLOs and error budgets at scale.
- Experience with Open Telemetry and modern observability standards.
- Familiarity with chaos engineering, resilience…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).