Principal Site Reliability Engineer Job Durham area,North Carolina USA,IT/Tech

## Principal Site Reliability Engineer Apply locations:
100 New Millennium Way, Bldg 2, Durham NCtime type:
Full time posted on:
Posted Todaytime left to apply:
End Date:
July 21, 2026 (30+ days left to apply) job requisition :
2130529## ##

Job Description:

** Position Description**:

Combines Operational excellence with Development experience to deliver services at high scale, high availability with resilience. Builds reliability into the ecosystem by applying best practices in Resiliency Engineering, Automation, Observability and Chaos Testing. Streamlines and accelerates software delivery cycle by using Dev Ops practices and toolchain. Integrates Site Reliability Engineering (SRE) practices (Observability and Chaos) with Dev Ops processes and delivery pipelines to stop bad code from reaching production.

Ensures business-critical enterprise systems are continuously available to internal and external customers. Implements technical standardization and process refinements within the engineering organization and for Site Reliability Engineers. Collaborates with production support teams to define and implement processes for the identification, collection, and analysis of incident data. Brings together technical, procedural, and financial data to reduce toil and increase efficiency.
*
* Primary Responsibilities:

*** Develops Chaos Testing capabilities using multiple Chaos Tools (AWS Fault Injection Service (FIS), Chaos Mesh, and Chaosd) and Chaos Toolkit.
* Develops and enhances organization’s internal Chaos Framework to streamline Chaos Executions and reporting.
* Provides specialized technical expertise in the adoption of Chaos Engineering by application teams.
* Chaos tests and observes business-critical applications to understand the weaknesses and increase application resiliency.
* Activates Observability for the critical applications with recommended Service Level Indicators and Service Level Objectives for Latency, Availability, Error Rate etc.
* Utilizes modern monitoring tools (Datadog, Splunk, Catchpoint etc.) to reduce mean time to detect an issue and improve the response times.
* Creates CI/CD pipelines with security and quality checks with Application Lifecycle management toolchain. Helps in integrating Chaos and Observability with CI/CD pipelines.
* Automates repetitive activities using scripting languages (Python, Groovy etc.).
* Implements and supports solutions based on cloud platforms AWS/Azure and container orchestration Kubernetes.
* Onboards /Evaluates New Cloud services that help to enhance the Resiliency of cloud ecosystem. Serves as a liaison for vendor engagement.
* Participates in incident management, problem management and incident postmortems.
* Takes part in peer code reviews providing qualitative feedback.
* Builds processes and capabilities to adapt and respond to risks, and disruptions, while maintaining business operations and data recovery with minimal disruptions.
* Coaches peer SREs and application teams on SRE and Dev Ops.
* Implements Agile methodologies in the team’s project completion using incremental and iterative steps.
** Education and Experience**:

Bachelor’s degree in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using Dev Ops or SRE practices, in a financial services environment.

Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, or a closely related field (or foreign education equivalent) and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) implementing resilient container and cloud-based applications and infrastructure solutions, using Dev Ops or SRE practices, in a financial services environment.
** Skills and Knowledge**:

Candidate must also possess:
* Demonstrated Expertise (“DE”) improving application resiliency by implementing chaos…