×
Register Here to Apply for Jobs or Post Jobs. X

Lead Engineer, SRE

Job in Milton Keynes, Buckinghamshire, MK1, England, UK
Listing for: The Open University UK
Part Time, Contract position
Listed on 2025-12-27
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability
Salary/Wage Range or Industry Benchmark: 59966 - 67468 GBP Yearly GBP 59966.00 67468.00 YEAR
Job Description & How to Apply Below

Job Location:

Milton Keynes, Remote/Hybrid

Salary: £59,966 to £67,468 + Market Related Pay of £6,746 per year until 31 December 2027 initially

Closing Date: 11 January 2026

Weekly

Working Hours:

37

Contract Type:
Permanent

Fixed Term

Contract:

End Date:
Not Applicable

Welsh Language:
Not Applicable

Change your career, change lives

The Open University is the UK’s largest university, a world leader in flexible part-time education combining a mission to widen access to higher education with research excellence, transforming lives through education. Find out more about us and our mission by watching this short video (you will be taken to You Tube by clicking this link).

About the Role

As a Lead Engineer, SRE at The Open University, you will be a key driver of reliability, scalability, and operational excellence across our platforms and services. This role goes beyond traditional operations—you will help shape how we design, build, deploy, and run systems in production, applying both a software engineering mindset and deep cloud expertise.

Your primary focus will be on Microsoft Azure, where you will architect and manage resilient cloud infrastructure, implement automation at scale, and integrate observability into all our deliverables. By partnering with architects, software engineers, and product teams, you will ensure our systems meet the highest standards of performance, security, and cost efficiency while remaining easy to operate.

In this senior role, you will:

  • Drive the adoption of SRE practices such as SLIs, SLOs, error budgets, and blameless postmortems to improve system reliability.
  • Embed automation and self-healing mechanisms to reduce manual toil and accelerate recovery from failures.
  • Champion infrastructure as code (IaC) using Bicep, ensuring consistent, repeatable, and compliant environments.
  • Build out end-to-end observability, enabling proactive issue detection and actionable insights into system health.
  • Partner with engineering leadership to shape the technical roadmap, guiding investments in scalability, resilience, and Dev Ops culture.

This role also carries a strong mentorship and leadership component. You will coach engineers across teams, advocate for best practices, and foster a build-run-own mindset that elevates operational maturity. As part of the engineering profession, you will influence architectural decisions, guide platform evolution, and ensure that our technical direction aligns with both short-term delivery goals and long-term strategic vision.

We are seeking an individual who excels in complex, hybrid environments—encompassing on-premises, cloud-native, and multi-cloud (Azure, AWS) platforms—and can effectively balance tactical problem-solving with strategic foresight. The ideal candidate will be passionate about automation, resilience engineering, and cloud-scale operations, with the ambition to make a lasting impact on how services are delivered and operated.

Key Responsibilities
  • Reliability & Performance: Ensure critical systems and applications are highly available, fault-tolerant, and performant. Implement SLIs, SLOs, and SLAs to measure and drive service reliability. Conduct capacity planning, performance tuning, and chaos engineering exercises to validate system resilience.
  • Cloud Platform Ownership: Design, build, and manage scalable infrastructure on Azure, leveraging services such as App Services, Functions, Service Bus, Front Door, Azure SQL and Event Hub. Use Infrastructure as Code (IaC) with Bicep and Terraform to standardise deployments. Optimise cloud cost efficiency (Fin Ops) while ensuring stability and performance.
  • Automation & Operations: Automate operational tasks using Power Shell, Bash, or Python. Enhance CI/CD pipelines to accelerate deployments and reduce production risks. Lead efforts to minimise toil by building self-healing and auto-scaling systems.
  • Observability & Incident Management: Implement robust monitoring, logging, and tracing solutions (e.g., Azure Monitor, Application Insights, Splunk). Lead incident response and postmortem reviews, identifying root causes and driving long-term fixes. Establish operational runbooks and playbooks to facilitate the rapid…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary