Senior SRE Job Orlando area,Florida USA,IT/Tech

Job Title:

Senior Site Reliability Engineer (SRE)

Overview / Summary

We are seeking a Site Reliability Engineer (SRE) with 8 10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications.

Key Responsibilities

Drive automation initiatives to improve system performance and operational efficiency.
Improve application reliability and availability by proactively identifying and mitigating risks.
Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages.
Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl
9.
Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments.
Drive observability improvements using Open Telemetry, Grafana Cloud, App Dynamics, Splunk, and monitoring best practices.
Perform performance and scalability reviews to support current and future demand.
Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering.
Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps.
Automate operational tasks and implement self-healing solutions.
Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities.
Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience.
Establish reliability governance, dashboards, runbooks, and continuous improvement processes.

Reliability Assessment & Engineering

Conduct application reliability assessments using established reliability frameworks.
Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns.
Identify reliability debt and drive remediation initiatives.
Evaluate application readiness for SRE engagement.
Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers.
Define reliability roadmaps and track improvement initiatives.

Incident Management & RCA

Analyze incident trends using CSI or equivalent incident management platforms.
Participate in Major Incident Management and Problem Management processes.
Drive RCA reviews and corrective actions.
Track reliability improvement initiatives resulting from postmortems.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).

Service Level Management

Define and implement SLIs.
Establish SLOs and Error Budgets using Nobl
9.
Partner with Product and Engineering teams to define business-focused reliability targets.
Build SLO dashboards and reliability scorecards.
Monitor error budget consumption and enforce governance policies.
Conduct reliability reviews based on SLO compliance.

Cloud & Platform Reliability

Review cloud architectures against AWS Well-Architected Framework principles.
Conduct reliability, performance, cost optimization, security, and operational excellence assessments.
Identify High Risk Issues (HRIs) and drive remediation.
Validate high availability, disaster recovery, backup, and failover capabilities.
Ensure multi-AZ and multi-region deployment strategies are implemented where required.

Kubernetes & Infrastructure Reliability

Review Kubernetes cluster health and workload configurations.
Validate resource requests, limits, autoscaling, and resiliency patterns.
Assess readiness, liveness, and startup probes.
Review service mesh configurations, network policies, and traffic routing.
Validate database high availability, caching strategies, and scaling configurations.
Identify and eliminate single points of failure.

Observability & Monitoring

Design and improve enterprise observability strategies.
Implement Open Telemetry-based telemetry collection.
Manage metrics, events, logs, and traces (MELT).
Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms.
Utilize AI-driven observability capabilities for anomaly…