Site Reliability Engineering Manager
Job in
Boston, Suffolk County, Massachusetts, 02298, USA
Listed on 2026-06-03
Listing for:
RapidSOS
Full Time
position Listed on 2026-06-03
Job specializations:
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Job Description & How to Apply Below
SOS will have handled ~1,380 emergencies.
At Rapid
SOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We're in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact.
Rapid
SOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams - enabling faster, smarter and more accurate emergency response. Real-time data from the world's largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the Rapid
SOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at
What this role is about:
This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping Rapid
SOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. Rapid
SOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter.
You'll lead the SRE Operations team and report to the Director of SRE & Platform Engineering. The team has real roots in NOC-style operations, and the honest goal of this role is to move it toward something more engineering-focused and proactive: better tooling, better practices, more ownership at the service team level. That's a gradual transition, and you'll be the one shaping how it happens.
What you'll do:
* Own the reliability, scalability, and operational health of Rapid
SOS Kubernetes clusters, shared services, and core AWS infrastructure; ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works
* Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard
* Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they ship; the goal is for product teams to own their services, not to have SRE own everything on their behalf
* Maintain proactive reliability work: capacity planning, failure mode analysis, runbook quality, and chaos engineering exercises; run reliability reviews before major launches and organize failure mode exercises with product teams
* Drive blameless postmortem practice, ensures every significant incident produces systemic improvements with clear ownership and closure
* Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers, coordination with the 3rd-party NOC, and keeping incident escalation processes smooth and manageable
* Lead incident command on Sev-1s, escalate when needed, and keep engineering leadership informed throughout
* Lead and grow a high-impact team by mentoring engineers, owning headcount, and thinking ahead about what the team needs as the function grows
* Shape the team's long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation, evaluating tooling and workflows, and operationalizing best practices for scalable team-wide usage
* Own reserved instance strategy and the team's AWS cost footprint, error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership
* Work alongside Platform SRE on bigger infrastructure projects:
Gateway API adoption, cross-region architecture, security changes
What we're looking for in our ideal candidate:
* 7+ years in SRE, platform engineering, or Dev Ops, with at least two years where you were responsible for a team as a manager
* You've been directly responsible for…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×