×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Greater London, London, Greater London, W1B, England, UK
Listing for: black.ai
Full Time, Per diem position
Listed on 2026-02-12
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Cloud Computing, Systems Engineer, IT Support
Salary/Wage Range or Industry Benchmark: 100000 - 125000 GBP Yearly GBP 100000.00 125000.00 YEAR
Job Description & How to Apply Below
Location: Greater London

Location

London

Employment Type

Full time

Location Type

Hybrid

Department

Engineering

Who We Are

Healthcare needs a better rhythm: one that keeps care continuous and deeply human. Heidi is building an AI Care Partner that works alongside clinicians to make that possible.

We’re a team of doctors, engineers, designers, researchers, and creatives building tools that help clinicians stay focused on what matters most: their patients.

In just 18 months, Heidi has given back more than 18 million hours to healthcare professionals - supporting 73 million patient visits in 116 countries. Today, more than two million patient visits each week are powered by Heidi worldwide.

Backed by nearly $100 million in funding, we’re growing in the US, UK, Canada, and Europe, partnering with leading health systems including the NHS, Beth Israel Lahey Health, and Monash Health.

The Role

This role sits in the core Platform/SRE team that owns production. You’ll work directly on incident response, on-call, system reliability, and day-to-day operations for Heidi’s platform.

We’re open to candidates who are strong mid-level SREs ready to take on more ownership, as well as senior SREs who enjoy being hands-on in operations. The role is intentionally ops-heavy and focused on keeping real systems healthy in production.

What you’ll do
  • Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.

  • Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.

  • Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.

  • Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.

  • Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.

  • Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.

  • Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.

  • Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.

What we’re looking for
  • 3–6+ years in SRE, Dev Ops, Platform, or operations-heavy engineering roles.

  • Experience supporting production systems and participating in on-call rotations.

  • Comfortable debugging live systems under pressure.

  • Experience operating cloud infrastructure (AWS preferred).

  • Working knowledge of Kubernetes and containerised workloads.

  • Infrastructure as Code experience (Terraform or similar).

  • Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).

  • Scripting or automation experience (Python, Bash, or similar).

Nice to have:

  • Experience leading incidents or mentoring others during on-call.

  • Experience in regulated or security-sensitive environments.

  • Familiarity with databases, queues, and caches in production.

  • Interest in reliability practices such as SLOs, error budgets, and capacity planning.

How We Work
  • We own production: The Platform/SRE team is responsible for reliability and incident response.

  • Incidents are blameless: We focus on learning and improving systems, not assigning fault.

  • Practical over perfect: We prioritise improvements that reduce real operational pain.

  • Calm under pressure: Clear thinking and communication matter during incidents.

What do we believe in?

Heidi builds for the future of healthcare, not just the next quarter, and our goals are ambitious because the world’s health demands it. We believe in progress built through precision, pace, and ownership.

  • Live Forever - Every release moves care forward: measured, safe, and built to last. Data…

Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary