Lead Site Reliability Engineer Job Stratford-upon-Avon area,England UK,IT/Tech

Overview

Modern tech-stack. Hybrid infrastructure. Reliability for 4,000+ users.

Lead Site Reliability Engineer

£64,000 - £74,000 (+ Benefits)

Grade: P3MP

Reports to: Senior Manager, Platform Engineering

Contract: Permanent

Hours: Full time 35 hours per week

Location: Stratford, London. Office-based with high flexibility (1-2 days per week in the office)

Visa sponsorship: Cancer Research UK can consider visa sponsorship for this vacancy. If this applies to you, please ensure that this is clearly marked on your application.

Closing date: 16 February 2026 23:55

This vacancy may close earlier if a high volume of applications is received or once a suitable candidate is found, therefore we strongly recommend that you apply early to avoid disappointment. If you require more time to apply as part of a reasonable adjustment, please contact

Recruitment process: Telephone interview followed by two competency-based interviews

Interview date: From the week commencing 23 February 2026

How do I apply? We operate an anonymised shortlisting process in our commitment to equality, diversity, and inclusion. CVs are required for all applications; but we won’t be able to view them until we invite you for an interview. Instead, we ask you to fully complete the work history section of the online application form for us to be able to assess you quickly, fairly, and objectively.

At Cancer Research UK, we exist to beat cancer. We are professionals with purpose, beating cancer every day. But we need to go much further and much faster. That’s why we’re looking for someone talented, someone who wants to develop their skills, someone like you.

Cancer Research UK has an ambitious Engineering Strategy supported by a modern Tech Stack and a complex hybrid infrastructure spanning on-premise and multi-cloud environments.

As a Lead Site Reliability Engineer, you’ll play a vital role in shaping and advancing SRE practices across the charity. You’ll lead incident response, drive automation to reduce operational toil, and act as the escalation point for complex production issues. You’ll define meaningful Service Level Objectives, strengthen observability, and help foster a blameless, learning-focused culture that continually improves reliability.

You’ll also lead and develop a team of Site Reliability Engineers, balancing day-to-day operational needs with engineering work that delivers long-term improvements. Working closely with development teams and Platform Engineering colleagues, you’ll embed SRE principles across our services, coaching engineers and influencing technical direction to ensure reliability is built in from the start.

If you’re an SRE leader who has strengthened large-scale production systems across complex on-premise and AWS environments, and you’re passionate about developing and leading teams to drive meaningful change, we would love for you to join our mission.

What will I be doing?

Ensuring the reliability, availability, and performance of Cancer Research UK’s production services across AWS, on-premise, and data centre environments. This includes:
- Defining and monitoring Service Level Objectives (SLOs), error budgets, and reliability metrics.
- Reducing incidents and operational toil through automation, engineering improvements, and continuous optimisation.
Leading incident response, promoting a blameless culture, coordinating cross-team response, and ensuring post-mortem and follow-up actions drive long-term improvement.
Building and maintaining comprehensive monitoring, logging, alerting, and tracing capabilities.
Creating tools and dashboards that give teams clear visibility into system health, performance, and reliability and help them proactively identify issues.
Collaborating closely with development teams, architects, and Platform Engineering colleagues to embed reliability, observability, and operability into service design.
Advising on scalability, performance, capacity planning, and production readiness at scale.
Driving automation and toil reduction through infrastructure as code, robust CI/CD pipelines, self-service tooling, and the removal of manual operational tasks.
Collaborating with the Head of Platform Engineering and…


Increase/decrease your Search Radius (miles)



Job Posting Language