Principal Reliability Engineer - EDS
Job in
Hartford, Hartford County, Connecticut, 06112, USA
Listed on 2026-06-24
Listing for:
The Hartford
Full Time, Part Time
position Listed on 2026-06-24
Job specializations:
-
IT/Tech
Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Data Engineering
Job Description & How to Apply Below
100% Remote locations:
Hartford, CT:
Charlotte, NC:
United States - Remote time type:
Full time posted on:
Posted Todayjob requisition :
R2625544
Principal Reliability Engineering - IE06JE
We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.
The Enterprise Data Services (EDS) organization is seeking a
** Principal Reliability Engineer (Principal RE)
** to serve as the senior technical authority responsible for the reliability, resilience, availability, and performance of all data platforms, cloud infrastructure, data products, and data pipelines across the enterprise data organization. This role sets the strategic vision for Reliability Engineering within EDS and leads the definition, implementation, and continuous evolution of RE practices, tooling, automation, observability frameworks, and AIOps/AI‐driven operations.
As the Principal RE, you will influence architectural direction, lead large‐scale, cross‐organizational technical initiatives, and drive a culture of engineering excellence, automation‐first operations, and proactive reliability improvement. You will partner closely with platform engineering, data engineering, security, architecture, and product teams to embed RE principles into every stage of the data product lifecycle.
This role will have a Hybrid work schedule, with the expectation of working in an office (Columbus, OH, Chicago, IL, Hartford, CT or Charlotte, NC) 3 days a week (Tuesday through Thursday).
** Key Responsibilities
**** Enterprise Reliability Strategy & Leadership
*** Work closely with the AVP, RE & Production Support, EDS defining the Reliability Engineering strategy for data platforms, data cloud environments, and data products.
* Establish long‐term RE roadmaps, target operating models, and architectural patterns that scale with organizational growth.
* Serve as the highest‐level technical escalation point for systemic reliability issues, influencing executive stakeholders and engineering leaders.
** Platform & Cloud Reliability (AWS, GCP, Snowflake, EMR, Hadoop, ETL/ELT)
*** Leverage Enterprise provided standards and building blocks to Architect and evolve highly reliable, performant, and cost‐efficient cloud‐based platforms across AWS and GCP for all EDS services.
* Influence and work directly with Platform Solution Architecture on new product enablement, hyper automation (end to end blueprint automation).
* Oversee reliability controls and fail‐safe patterns for Snowflake, EMR, Hadoop/Spark clusters, container platforms (e.g., Kubernetes), and mission‐critical data systems.
* Lead the creation and enforcement of SLO/SLI frameworks that span the entire data lifecycle.
** AI‐Enabled Operations, AIOps & Intelligent Automation
*** Develop and implement AI‐driven automation for anomaly detection, alert correlation, autonomous remediation, and predictive capacity management.
* Leverage LLMs, prompt engineering, and cloud‐native AI services (AWS Bedrock, Sage Maker, Vertex AI) to build intelligent runbooks, advanced troubleshooting agents, and generative‐AI‐enabled operational tooling.
* Champion the adoption of machine learning–based observability and reliability analytics.
** End‐to‐End Observability & Operational Excellence
*** Adopt and architect enterprise‐wide data observability frameworks—including logging, metrics, tracing, distributed profiling, and event pipelines—for all data platforms and pipelines.
* Establish gold‐standard incident response patterns, post‐incident reviews, and continuous improvement processes.
* Drive elimination of toil across EDS, focusing on self‐healing systems, proactive detection, and autonomous operations.
** Data Pipeline & Data Product Reliability
*** Define RE best practices for modern data products, governed data pipelines, real‐time/streaming systems, and operational analytics platforms.
* Ensure data quality, data timeliness, and SLAs for data products through automated…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×