Lead Engineer,SRE Job Milton Keynes area,England UK,IT/Tech

Job Location:

Milton Keynes, Remote/Hybrid

Salary: £59,966 to £67,468 + Market Related Pay of £6,746 per year until 31 December 2027 initially

Closing Date: 11 January 2026

Weekly

Working Hours:

37

Contract Type:
Permanent

Fixed Term

Contract:

End Date:
Not Applicable

Welsh Language:
Not Applicable

Change your career, change lives

The Open University is the UK’s largest university, a world leader in flexible part-time education combining a mission to widen access to higher education with research excellence, transforming lives through education. Find out more about us and our mission by watching this short video (you will be taken to You Tube by clicking this link).

About the Role

As a Lead Engineer, SRE at The Open University, you will be a key driver of reliability, scalability, and operational excellence across our platforms and services. This role goes beyond traditional operations—you will help shape how we design, build, deploy, and run systems in production, applying both a software engineering mindset and deep cloud expertise.

Your primary focus will be on Microsoft Azure, where you will architect and manage resilient cloud infrastructure, implement automation at scale, and integrate observability into all our deliverables. By partnering with architects, software engineers, and product teams, you will ensure our systems meet the highest standards of performance, security, and cost efficiency while remaining easy to operate.

In this senior role, you will:

Drive the adoption of SRE practices such as SLIs, SLOs, error budgets, and blameless postmortems to improve system reliability.
Embed automation and self-healing mechanisms to reduce manual toil and accelerate recovery from failures.
Champion infrastructure as code (IaC) using Bicep, ensuring consistent, repeatable, and compliant environments.
Build out end-to-end observability, enabling proactive issue detection and actionable insights into system health.
Partner with engineering leadership to shape the technical roadmap, guiding investments in scalability, resilience, and Dev Ops culture.

This role also carries a strong mentorship and leadership component. You will coach engineers across teams, advocate for best practices, and foster a build-run-own mindset that elevates operational maturity. As part of the engineering profession, you will influence architectural decisions, guide platform evolution, and ensure that our technical direction aligns with both short-term delivery goals and long-term strategic vision.

We are seeking an individual who excels in complex, hybrid environments—encompassing on-premises, cloud-native, and multi-cloud (Azure, AWS) platforms—and can effectively balance tactical problem-solving with strategic foresight. The ideal candidate will be passionate about automation, resilience engineering, and cloud-scale operations, with the ambition to make a lasting impact on how services are delivered and operated.

Key Responsibilities

Reliability & Performance: Ensure critical systems and applications are highly available, fault-tolerant, and performant. Implement SLIs, SLOs, and SLAs to measure and drive service reliability. Conduct capacity planning, performance tuning, and chaos engineering exercises to validate system resilience.
Cloud Platform Ownership: Design, build, and manage scalable infrastructure on Azure, leveraging services such as App Services, Functions, Service Bus, Front Door, Azure SQL and Event Hub. Use Infrastructure as Code (IaC) with Bicep and Terraform to standardise deployments. Optimise cloud cost efficiency (Fin Ops) while ensuring stability and performance.
Automation & Operations: Automate operational tasks using Power Shell, Bash, or Python. Enhance CI/CD pipelines to accelerate deployments and reduce production risks. Lead efforts to minimise toil by building self-healing and auto-scaling systems.
Observability & Incident Management: Implement robust monitoring, logging, and tracing solutions (e.g., Azure Monitor, Application Insights, Splunk). Lead incident response and postmortem reviews, identifying root causes and driving long-term fixes. Establish operational runbooks and playbooks to facilitate the rapid…


Increase/decrease your Search Radius (miles)



Job Posting Language