Senior Specialist Engineer; SRE - UKHSA - SEO
Listed on 2025-12-14
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Senior Specialist Engineer (SRE) – UKHSA – SEO
We are seeking a highly motivated and experienced Site Reliability Engineer to join our High Performance Computing, Site Reliability Engineering, Artificial Intelligence (HPC/SRE/AI) & research computing unit at UK Health Security Agency (UKHSA). The role will be based in Manchester Digital and will operate as a hybrid position across UKHSA headquarters (Birmingham, Leeds, Liverpool, London) with a minimum of 60% onsite.
Location& Working Arrangement
Hybrid working model: minimum 60% contractual hours (≈3 days a week pro rata) at one of UKHSA's core HQs (Birmingham, Leeds, Liverpool, London). Modern refurbished offices with excellent transport links. Public space collaboration with other government departments including DHSC.
About The JobThe Digital and Data Directorate provides scientific and research computing services. The Digital Development and Operations unit delivers platforms and technical capabilities to enable public health services within the organisation and with clients and stakeholders.
Key Responsibilities- Remediate infrastructure and operational problems.
- Leverage automation and CI/CD to ensure reliable, scalable, and high‑performance services.
- Monitor and manage cloud infrastructure services and observe systems to prioritize operational and performance improvements meeting/exceeding SLOs.
- Architect, develop & manage multi‑cloud HPC platforms and on‑premise infrastructure.
- Ensure services are highly available, scalable, and resilient.
- Manage performance, capacity planning, and support UKHSA's AI requirements.
- Respond swiftly to production incidents with minimal downtime and rapid restoration.
- Perform root cause analysis and post‑mortems to implement lessons learned.
- Design and implement effective monitoring and alerting systems using Prometheus, Grafana, etc.
- Improve observability to identify issues before impacting users.
- Continuously refine practices to reduce alert fatigue.
- Develop automation to eliminate manual repetitive tasks and improve efficiency.
- Write clean, maintainable, well‑tested code for automation and tooling.
- Drive initiatives to reduce operational toil via Infrastructure as Code.
- Define, track, and improve SLOs, SLI, and error budgets.
- Prioritize improvements aligning with business goals & user experience.
- Evangelize SRE principles across the organisation.
- Integrate reliability practices into the development lifecycle.
- Collaborate with software engineering, Dev Ops, and infrastructure teams.
- Promote culture of shared responsibility for service reliability.
- Maintain accurate technical documents, runbooks, post‑incident reports.
- Provide training and mentorship on best practices and tools.
- Experience as a Site Reliability Engineer, Dev Ops Engineer, Operations Engineer or similar.
- Programming/scripting skills in Python, Power Shell, Bash.
- Understanding of Linux/Unix, Windows, networking, distributed systems.
- Experience with observability tools (Prometheus, Grafana, Datadog) and alerting systems.
- Infrastructure automation skills (Terraform, Ansible, Helm).
- Excellent communication and collaboration skills.
- Experience with security best practices.
- Strong problem‑solving skills and ability to respond to sudden demands.
- CI/CD pipelines, cloud platforms (AWS, GCP, Azure), and Kubernetes experience.
- Post‑incident review experience.
- Driving SRE practice adoption across an organisation.
- Delivering training or mentoring of junior engineers.
Mid‑Senior level
Employment TypeFull‑time
Job FunctionEngineering and Information Technology;
Industries:
Technology, Information and Internet
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: