×
Register Here to Apply for Jobs or Post Jobs. X

Senior Specialist Engineer; SRE - UKHSA - SEO

Job in Manchester, Greater Manchester, M9, England, UK
Listing for: Manchester Digital
Part Time position
Listed on 2025-12-14
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, IT Support, SRE/Site Reliability
Job Description & How to Apply Below
Position: Senior Specialist Engineer (SRE) - UKHSA - SEO

Senior Specialist Engineer (SRE) – UKHSA – SEO

We are seeking a highly motivated and experienced Site Reliability Engineer to join our High Performance Computing, Site Reliability Engineering, Artificial Intelligence (HPC/SRE/AI) & research computing unit at UK Health Security Agency (UKHSA). The role will be based in Manchester Digital and will operate as a hybrid position across UKHSA headquarters (Birmingham, Leeds, Liverpool, London) with a minimum of 60% onsite.

Location

& Working Arrangement

Hybrid working model: minimum 60% contractual hours (≈3 days a week pro rata) at one of UKHSA's core HQs (Birmingham, Leeds, Liverpool, London). Modern refurbished offices with excellent transport links. Public space collaboration with other government departments including DHSC.

About The Job

The Digital and Data Directorate provides scientific and research computing services. The Digital Development and Operations unit delivers platforms and technical capabilities to enable public health services within the organisation and with clients and stakeholders.

Key Responsibilities
  • Remediate infrastructure and operational problems.
  • Leverage automation and CI/CD to ensure reliable, scalable, and high‑performance services.
  • Monitor and manage cloud infrastructure services and observe systems to prioritize operational and performance improvements meeting/exceeding SLOs.
  • Architect, develop & manage multi‑cloud HPC platforms and on‑premise infrastructure.
  • Ensure services are highly available, scalable, and resilient.
  • Manage performance, capacity planning, and support UKHSA's AI requirements.
Incident Response & Troubleshooting
  • Respond swiftly to production incidents with minimal downtime and rapid restoration.
  • Perform root cause analysis and post‑mortems to implement lessons learned.
Monitoring, Alerting & Observability
  • Design and implement effective monitoring and alerting systems using Prometheus, Grafana, etc.
  • Improve observability to identify issues before impacting users.
  • Continuously refine practices to reduce alert fatigue.
Automation & Tooling
  • Develop automation to eliminate manual repetitive tasks and improve efficiency.
  • Write clean, maintainable, well‑tested code for automation and tooling.
  • Drive initiatives to reduce operational toil via Infrastructure as Code.
Service Level Objectives & Operational Improvements
  • Define, track, and improve SLOs, SLI, and error budgets.
  • Prioritize improvements aligning with business goals & user experience.
SRE Best Practices & Advocacy
  • Evangelize SRE principles across the organisation.
  • Integrate reliability practices into the development lifecycle.
Collaboration & Knowledge Sharing
  • Collaborate with software engineering, Dev Ops, and infrastructure teams.
  • Promote culture of shared responsibility for service reliability.
Documentation & Training
  • Maintain accurate technical documents, runbooks, post‑incident reports.
  • Provide training and mentorship on best practices and tools.
Essential Criteria
  • Experience as a Site Reliability Engineer, Dev Ops Engineer, Operations Engineer or similar.
  • Programming/scripting skills in Python, Power Shell, Bash.
  • Understanding of Linux/Unix, Windows, networking, distributed systems.
  • Experience with observability tools (Prometheus, Grafana, Datadog) and alerting systems.
  • Infrastructure automation skills (Terraform, Ansible, Helm).
  • Excellent communication and collaboration skills.
  • Experience with security best practices.
  • Strong problem‑solving skills and ability to respond to sudden demands.
Desirable Criteria
  • CI/CD pipelines, cloud platforms (AWS, GCP, Azure), and Kubernetes experience.
  • Post‑incident review experience.
  • Driving SRE practice adoption across an organisation.
  • Delivering training or mentoring of junior engineers.
Seniority Level

Mid‑Senior level

Employment Type

Full‑time

Job Function

Engineering and Information Technology;
Industries:
Technology, Information and Internet

#J-18808-Ljbffr
Position Requirements
10+ Years work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary