Reliability Engineer
Chicago, Cook County, Illinois, 60290, USA
Listed on 2026-02-17
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
Overview
Staff Reliability Engineer - IE07KE
We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future. The Hartford is seeking a highly skilled Senior Reliability Engineer (RE) to join our Enterprise Data Organization.
This role is pivotal in applying software engineering principles to operations, ensuring the reliability, performance, and scalability of our foundational data infrastructure, platforms and applications in this organization. You will be instrumental in driving our transition from traditional production support to a modern RE model through automation, toil reduction, and standardized service management.
This role can have a Hybrid or Remote work arrangement. Candidates who live near one of our locations will have the expectation of working in an office 3 days a week (Tuesday through Thursday). Candidates who do not live near an office should maintain their current work arrangement with the expectation of coming into the office as business needs arise.
Responsibilities- Platform Reliability & Resiliency:
Design, build, and maintain highly reliable, scalable, and resilient cloud-based data platforms on AWS and GCP, including core infrastructure and services like Snowflake, EKS, Open Search, EMR and Hadoop ecosystems. - Automation & Toil Reduction:
Champion the RE mandate by identifying manual, repetitive operational tasks (toil) and developing robust automation solutions to eliminate them. This includes automating provisioning, deployment, self-healing and operational tasks. - Observability & Monitoring:
Implement and manage comprehensive observability solutions (monitoring, alerting, logging, tracing) for the underlying data infrastructure, applications focusing on establishing clear Service Level Indicators (SLIs), Service Level Objectives (SLOs). - Incident Response & Management:
Act as an escalation point for production incidents, leading incident response, performing deep root cause analysis (RCA), designing error budgets and implementing preventative measures to ensure issues do not recur. - Standardization & Documentation:
Lead the standardization of operational processes and documentation, including the creation and automation of dynamic runbooks and playbooks for consistent and efficient incident resolution and service management. - RE Transition:
Lead as RE Subject Matter Expert and collaborate with other Platform, Product and Data Engineering Support teams to instill RE best practices, including participation in system design consulting, capacity planning, and deployment pipelines (CI/CD).
- 10+ years’ overall experience in an Infrastructure, Data or related technology organization with increasing responsibilities as a hands-on technologist.
- Must have 5+ year experience as an RE, Cloud, Dev Ops Engineer, or similar role supporting large-scale enterprise infrastructure and applications.
- Strong scripting and programming skills (Python etc.) for automation and tooling development.
- Experience with infrastructure-as-code (e.g., Terraform, Cloud Formation, Ansible) and CI/CD tools.
- Experience designing and operating reliable and resilient infrastructure, fail-safe patterns, reliability controls, and observability from a Reliability Engineering (SRE/RE) infrastructure support perspective across cloud and big data platforms (AWS, GCP, Amazon EMR, Hadoop/Spark, Open Search, and container orchestration platforms etc.)
- Familiarity with cloud-native integrations with databases, data integration, and business intelligence platforms (Snowflake, Informatica IDMC, Tableau, and Thought Spot etc.)
- Expertise in setting up and tuning monitoring and alerting systems (e.g., Dynatrace, Splunk, Prometheus, Grafana, Datadog, Open Telemetry etc.).
- Expertise defining and implementing of Data Ops practices
- Expertise implementing AIOps to monitor, manage and self-heal infrastructure, data platforms, experience implementing machine learning principles for anomaly detection, alerting and runbook…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).