Site Reliability Engineering Manager Job Boston area,Massachusetts USA,IT/Tech

In the time it takes you to read this job description, Rapid

SOS will have handled ~1,380 emergencies.

At Rapid

SOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We're in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact.

Rapid

SOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams - enabling faster, smarter and more accurate emergency response. Real-time data from the world's largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the Rapid

SOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at

What this role is about:

This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping Rapid

SOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. Rapid

SOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.

You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.

What you'll do:

* Own the reliability, scalability, and operational health of Rapid

SOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works

* Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard

* Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf

* Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams

* Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure

* Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable

* Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout

* Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows

* Shape the team's long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage

* Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership

* Work alongside Platform SRE on bigger infrastructure projects:
Gateway API adoption, cross-region architecture, security changes

What we're looking for in our ideal candidate:

* 7+ years in SRE, platform engineering, or Dev Ops, with at least two years where you were responsible for a team as a manager

* You've been directly responsible for…