Site Reliability Engineer
Listed on 2026-02-16
-
IT/Tech
Systems Engineer, SRE/Site Reliability
Overview
We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform. This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.
This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.
About Lyrebird Health
Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day. They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.
You’ll Do
- Reliability & Production Engineering
- Own reliability outcomes across core services and customer-facing systems
- Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
- Lead initiatives to improve uptime, latency, and overall system resilience
- Proactively identify reliability risks and drive mitigation plans to completion
- Observability & Incident Response
- Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
- Lead incident response for high-severity events and guide teams through calm, effective mitigation
- Drive post-incident reviews that result in measurable, lasting improvements
- Build a culture of operational excellence: fewer incidents, faster recovery, better learning
- Platform Enablement
- Develop internal tooling and paved paths that make “doing the right thing” the easiest option
- Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
- Partner with engineers to uplift production-readiness across new and existing services
- Infrastructure & Automation
- Improve infrastructure reliability and maintainability using Infrastructure as Code
- Strengthen deployment workflows and reduce operational toil through automation
- Help shape architecture decisions with a reliability and scalability lens
- Security & Compliance Support
- Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
- Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery
- 8+ years of engineering experience, with significant depth in SRE / platform/production systems
- Strong experience operating and improving systems in production (including incident response)
- Proven ability to lead cross-team initiatives and influence engineering standards
- Technical Strength You don’t need to tick every box, but you should be strong across most:
Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, Cloud Watch) - Infrastructure as Code (Terraform)
- Observability
- Strong grasp of monitoring and alerting principles
- Experience with logs + metrics + tracing and building meaningful dashboards
- Familiar with Open Telemetry and modern observability tooling
- Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
- Software Engineering
- Ability to build tools and automation (Type Script, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
- Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)
- Experience with service mesh patterns, multi-account AWS environments, or multi-region design
- Experience working with healthcare or regulated domains
- Experience scaling engineering org practices as the company grows
- You’re deeply accountable -…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: