Senior SRE
Listed on 2026-06-20
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer, Cybersecurity
Senior Site Reliability Engineer (SRE)
Overview / Summary
We are seeking a Site Reliability Engineer (SRE) with 8 10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications.
Key Responsibilities
Drive automation initiatives to improve system performance and operational efficiency.
Improve application reliability and availability by proactively identifying and mitigating risks.
Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages.
Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl
9.
Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments.
Drive observability improvements using Open Telemetry, Grafana Cloud, App Dynamics, Splunk, and monitoring best practices.
Perform performance and scalability reviews to support current and future demand.
Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering.
Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps.
Automate operational tasks and implement self-healing solutions.
Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities.
Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience.
Establish reliability governance, dashboards, runbooks, and continuous improvement processes.
Reliability Assessment & Engineering
Conduct application reliability assessments using established reliability frameworks.
Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns.
Identify reliability debt and drive remediation initiatives.
Evaluate application readiness for SRE engagement.
Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers.
Define reliability roadmaps and track improvement initiatives.
Incident Management & RCA
Analyze incident trends using CSI or equivalent incident management platforms.
Participate in Major Incident Management and Problem Management processes.
Drive RCA reviews and corrective actions.
Track reliability improvement initiatives resulting from postmortems.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).
Service Level Management
Define and implement SLIs.
Establish SLOs and Error Budgets using Nobl
9.
Partner with Product and Engineering teams to define business-focused reliability targets.
Build SLO dashboards and reliability scorecards.
Monitor error budget consumption and enforce governance policies.
Conduct reliability reviews based on SLO compliance.
Cloud & Platform Reliability
Review cloud architectures against AWS Well-Architected Framework principles.
Conduct reliability, performance, cost optimization, security, and operational excellence assessments.
Identify High Risk Issues (HRIs) and drive remediation.
Validate high availability, disaster recovery, backup, and failover capabilities.
Ensure multi-AZ and multi-region deployment strategies are implemented where required.
Kubernetes & Infrastructure Reliability
Review Kubernetes cluster health and workload configurations.
Validate resource requests, limits, autoscaling, and resiliency patterns.
Assess readiness, liveness, and startup probes.
Review service mesh configurations, network policies, and traffic routing.
Validate database high availability, caching strategies, and scaling configurations.
Identify and eliminate single points of failure.
Observability & Monitoring
Design and improve enterprise observability strategies.
Implement Open Telemetry-based telemetry collection.
Manage metrics, events, logs, and traces (MELT).
Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms.
Utilize AI-driven observability capabilities for anomaly…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).