Lead Site Reliability Engineer
Listed on 2026-03-02
-
IT/Tech
Systems Engineer, Cloud Computing
Lead Site Reliability Engineer
Eden Prairie, MN;
Arizona; or Telecommute
Hybrid role
Compensation: $60
THE ROLE
Our Client is seeking a Lead Site Reliability Engineer to serve as the technical anchor for a team dedicated to ensuring the stability, performance, and resiliency of our most critical applications. In this pivotal role, you will provide technical leadership on system architecture, workload placement, and transaction optimization. You will architect and implement solutions to enhance system reliability, drive incident management maturity, and lead efforts to eliminate single points of failure across both on-premises and cloud environments.
The successful candidate will mentor a team of engineers, optimize applications for performance and cost, and establish robust monitoring and observability practices. You will also play a key role in enhancing failover capabilities and ensuring services are designed for zonal and regional resiliency.
- Provide technical leadership and mentorship to engineers on Site Reliability Engineering (SRE) best practices.
- Architect and implement solutions to improve system reliability and eliminate single points of failure for critical applications, including technologies such as Azure Front Door and Cloudflare.
- Drive incident management maturity by reducing Mean Time to Recover (MTTR) to 60 minutes or less and conducting deep root cause analysis on P1/P2 incidents.
- Lead the development and build-out of proactive monitoring solutions, including Business Journey Maps, using observability tools such as Dynatrace.
- Establish and lead ongoing architectural review processes to ensure zonal and regional resiliency for cloud-native applications.
- Partner with development teams to optimize applications for performance, reliability, and cost in the cloud.
- Make direct technical adjustments to improve system stability and guide the team in eliminating single points of failure.
- 8+ years of experience in technical roles such as Software Engineering, Systems Engineering, or Dev Ops.
- 3+ years of dedicated experience as a Site Reliability Engineer.
- Deep expertise with at least one major cloud platform (Azure, AWS, GCP).
- Proven experience with observability and monitoring tools (e.g., Dynatrace, Prometheus, Grafana).
- Strong understanding of networking, distributed systems, and infrastructure-as-code (Terraform, Ansible).
- Experience as a technical lead or mentor.
- Proficiency in one or more programming languages (e.g., Python, Go, Java).
- Experience in a full-stack engineering capacity.
- Knowledge of containerization and orchestration (Docker, Kubernetes).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).