Lead Engineer, SRE
Listed on 2025-12-26
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Join to apply for the Lead Engineer, SRE role at The Open University
Base pay rangeDirect message the job poster from The Open University
Are you passionate about reliability engineering and ready to lead the evolution of cloud-scale operations at a world-class institution?
Join The Open University as a Lead Engineer, SRE and help shape the future of resilient, automated, and observable platforms that power learning for millions.
About the RoleAs a Lead Engineer, SRE at The Open University, you will be a key driver of reliability, scalability, and operational excellence across our platforms and services. This role goes beyond traditional operations—you will help shape how we design, build, deploy, and run systems in production, applying both a software engineering mindset and deep cloud expertise.
Your primary focus will be on Microsoft Azure, where you will architect and manage resilient cloud infrastructure, implement automation at scale, and integrate observability into all our deliverables. By partnering with architects, software engineers, and product teams, you will ensure our systems meet the highest standards of performance, security, and cost efficiency while remaining easy to operate.
In this senior role, you will:
- Drive the adoption of SRE practices such as SLIs, SLOs, error budgets, and blameless post‑mortems to improve system reliability.
- Embed automation and self‑healing mechanisms to reduce manual toil and accelerate recovery from failures.
- Champion infrastructure as code (IaC) using Bicep, ensuring consistent, repeatable, and compliant environments.
- Build out end‑to‑end observability, enabling proactive issue detection and actionable insights into system health.
- Partner with engineering leadership to shape the technical roadmap, guiding investments in scalability, resilience, and Dev Ops culture.
This role also carries a strong mentorship and leadership component. You will coach engineers across teams, advocate for best practices, and foster a build‑run‑own mindset that elevates operational maturity.
We are seeking an individual who excels in complex, hybrid environments—encompassing on‑premises, cloud‑native, and multi‑cloud (Azure, AWS) platforms—and can effectively balance tactical problem‑solving with strategic foresight. The ideal candidate will be passionate about automation, resilience engineering, and cloud‑scale operations, with the ambition to make a lasting impact on how services are delivered and operated.
Key Responsibilities- Reliability & Performance: Ensure critical systems and applications are highly available, fault‑tolerant, and performant. Implement SLIs, SLOs, and SLAs to measure and drive service reliability. Conduct capacity planning, performance tuning, and chaos engineering exercises to validate system resilience.
- Cloud Platform Ownership: Design, build, and manage scalable infrastructure on Azure, leveraging services such as App Services, Functions, Service Bus, Front Door, Azure SQL and Event Hub. Use Infrastructure as Code (IaC) with Bicep and Terraform to standardise deployments. Optimise cloud cost efficiency (Fin Ops) while ensuring stability and performance.
- Automation & Operations: Automate operational tasks using Power Shell, Bash, or Python. Enhance CI/CD pipelines to accelerate deployments and reduce production risks. Lead efforts to minimise toil by building self‑healing and auto‑scaling systems.
- Observability & Incident Management: Implement robust monitoring, logging, and tracing solutions (e.g., Azure Monitor, Application Insights, Splunk). Lead incident response and post‑mortem reviews, identifying root causes and driving long‑term fixes. Establish operational runbooks and playbooks to facilitate the rapid resolution of incidents.
- Security & Compliance: Embed security‑by‑design practices in infrastructure and pipelines. Ensure compliance with relevant standards through proactive monitoring and automation. Collaborate with security teams to manage vulnerability assessments and remediation efforts.
- Collaboration & Leadership: Act as a trusted advisor to engineering teams on scalability, reliability, and operational excellence. Mentor engineers in SRE…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: