Senior Specialist Engineer; SRE - UKHSA - SEO Job Manchester area,England UK,IT/Tech

Position: Senior Specialist Engineer (SRE) - UKHSA - SEO

Senior Specialist Engineer (SRE) – UKHSA – SEO

We are seeking a highly motivated and experienced Site Reliability Engineer to join our High Performance Computing, Site Reliability Engineering, Artificial Intelligence (HPC/SRE/AI) & research computing unit at UK Health Security Agency (UKHSA). The role will be based in Manchester Digital and will operate as a hybrid position across UKHSA headquarters (Birmingham, Leeds, Liverpool, London) with a minimum of 60% onsite.

Location

& Working Arrangement

Hybrid working model: minimum 60% contractual hours (≈3 days a week pro rata) at one of UKHSA's core HQs (Birmingham, Leeds, Liverpool, London). Modern refurbished offices with excellent transport links. Public space collaboration with other government departments including DHSC.

About The Job

The Digital and Data Directorate provides scientific and research computing services. The Digital Development and Operations unit delivers platforms and technical capabilities to enable public health services within the organisation and with clients and stakeholders.

Key Responsibilities

Remediate infrastructure and operational problems.
Leverage automation and CI/CD to ensure reliable, scalable, and high‑performance services.
Monitor and manage cloud infrastructure services and observe systems to prioritize operational and performance improvements meeting/exceeding SLOs.
Architect, develop & manage multi‑cloud HPC platforms and on‑premise infrastructure.
Ensure services are highly available, scalable, and resilient.
Manage performance, capacity planning, and support UKHSA's AI requirements.

Incident Response & Troubleshooting

Respond swiftly to production incidents with minimal downtime and rapid restoration.
Perform root cause analysis and post‑mortems to implement lessons learned.

Monitoring, Alerting & Observability

Design and implement effective monitoring and alerting systems using Prometheus, Grafana, etc.
Improve observability to identify issues before impacting users.
Continuously refine practices to reduce alert fatigue.

Automation & Tooling

Develop automation to eliminate manual repetitive tasks and improve efficiency.
Write clean, maintainable, well‑tested code for automation and tooling.
Drive initiatives to reduce operational toil via Infrastructure as Code.

Service Level Objectives & Operational Improvements

Define, track, and improve SLOs, SLI, and error budgets.
Prioritize improvements aligning with business goals & user experience.

SRE Best Practices & Advocacy

Evangelize SRE principles across the organisation.
Integrate reliability practices into the development lifecycle.

Collaboration & Knowledge Sharing

Collaborate with software engineering, Dev Ops, and infrastructure teams.
Promote culture of shared responsibility for service reliability.

Documentation & Training

Maintain accurate technical documents, runbooks, post‑incident reports.
Provide training and mentorship on best practices and tools.

Essential Criteria

Experience as a Site Reliability Engineer, Dev Ops Engineer, Operations Engineer or similar.
Programming/scripting skills in Python, Power Shell, Bash.
Understanding of Linux/Unix, Windows, networking, distributed systems.
Experience with observability tools (Prometheus, Grafana, Datadog) and alerting systems.
Infrastructure automation skills (Terraform, Ansible, Helm).
Excellent communication and collaboration skills.
Experience with security best practices.
Strong problem‑solving skills and ability to respond to sudden demands.

Desirable Criteria

CI/CD pipelines, cloud platforms (AWS, GCP, Azure), and Kubernetes experience.
Post‑incident review experience.
Driving SRE practice adoption across an organisation.
Delivering training or mentoring of junior engineers.

Seniority Level

Mid‑Senior level

Employment Type

Full‑time

Job Function

Engineering and Information Technology;
Industries:
Technology, Information and Internet

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language