Site Reliability Engineer Job London area,England UK,IT/Tech

Position: Staff Site Reliability Engineer

Overview

We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform. This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.

This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.

About Lyrebird Health
Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day. They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.

What

You’ll Do

Reliability & Production Engineering
- Own reliability outcomes across core services and customer-facing systems
- Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
- Lead initiatives to improve uptime, latency, and overall system resilience
- Proactively identify reliability risks and drive mitigation plans to completion
Observability & Incident Response
- Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
- Lead incident response for high-severity events and guide teams through calm, effective mitigation
- Drive post-incident reviews that result in measurable, lasting improvements
- Build a culture of operational excellence: fewer incidents, faster recovery, better learning
Platform Enablement
- Develop internal tooling and paved paths that make “doing the right thing” the easiest option
- Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
- Partner with engineers to uplift production-readiness across new and existing services
Infrastructure & Automation
- Improve infrastructure reliability and maintainability using Infrastructure as Code
- Strengthen deployment workflows and reduce operational toil through automation
- Help shape architecture decisions with a reliability and scalability lens
Security & Compliance Support
- Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
- Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

What We’re Looking For

8+ years of engineering experience, with significant depth in SRE / platform/production systems
Strong experience operating and improving systems in production (including incident response)
Proven ability to lead cross-team initiatives and influence engineering standards
Technical Strength You don’t need to tick every box, but you should be strong across most:
Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, Cloud Watch)
Infrastructure as Code (Terraform)
Observability
Strong grasp of monitoring and alerting principles
Experience with logs + metrics + tracing and building meaningful dashboards
Familiar with Open Telemetry and modern observability tooling
Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
Software Engineering
- Ability to build tools and automation (Type Script, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Bonus Skill (Nice to Have)

Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
Experience with service mesh patterns, multi-account AWS environments, or multi-region design
Experience working with healthcare or regulated domains
Experience scaling engineering org practices as the company grows

Who You Are

You’re deeply accountable -…


Increase/decrease your Search Radius (miles)



Job Posting Language