×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in Boston, Suffolk County, Massachusetts, 02298, USA
Listing for: RapidSOS
Full Time position
Listed on 2026-04-27
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing, Network Engineer
Salary/Wage Range or Industry Benchmark: 160000 - 195000 USD Yearly USD 160000.00 195000.00 YEAR
Job Description & How to Apply Below

At Rapid

SOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission‑driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose
, inventing tomorrow
, delivering with urgency
, serving with integrity
, and winning together
, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact.

Rapid

SOS is the leading public safety AI company that unlocks mission‑critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real‑time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the Rapid

SOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at

What this role is about

Are you excited to work on systems where reliability directly impacts real‑world outcomes? At Rapid

SOS, we build technology that powers emergency response, ensuring critical data gets to the right place at the right time. When these systems degrade or fail, the impact is real and reliability isn’t a background function. It’s fundamental to how our product shows up in critical moments.

What you’ll do
  • Own performance and reliability outcomes:
    Ownership of how application‑level decisions create system‑level impact, including connection pooling, database architecture, traffic routing patterns, and memory allocation. Collaborate with engineering teams that own specific domains, partnering directly to improve reliability and performance across their systems.
  • Design for system resilience:
    Responsibility for strengthening reliability through proactive design decisions, including safer deployment patterns, failover strategies, and redundancy approaches that improve system behavior under stress.
  • Build observability into system behavior:
    Proactively instrument services with structured logging, metrics, and alerting so systems are easier to understand and debug. The focus is on creating clear signals from production behavior before issues escalated.
  • Own incidents from signal to resolution:
    Ownership of production issues from first signal through resolution, including investigation across infrastructure and application layers, root cause identification, and implementation of fixes that restore stability and strengthen system behavior long term.
  • Work across the stack without a permission slip:
    You’ll work across infrastructure‑as‑code, container orchestration, CI/CD pipelines, and service‑level application code. When issues come up, you don’t wait for a handoff—ownership is taken directly and driven through to resolution.
What we’re looking for in our ideal candidate
  • 5+ years of professional engineering experience with deep expertise in Python
  • Real cloud infrastructure experience with AWS: networking, managed databases, cost implications of traffic routing decisions, IAM, DNS‑based routing and failover
  • Hands‑on Kubernetes experience with containerized workloads in production across EKS, ECS, or Fargate, you can read events, understand resource limits, know when to drain vs. delete a node, and understand the trade‑offs between orchestration models
  • Strong understanding of distributed systems and how they fail, including resource exhaustion, replication lag, queue back pressure, and other common failure modes
  • Experience operating high‑throughput messaging systems (Rabbit

    MQ, Kafka, AWS SNS / SQS, etc.) and the infrastructure around them, including infrastructure‑as‑code (e.g., Terraform) and CI/CD pipelines, with an emphasis on improving reliability and scalability
  • Experience building or improving observability through logging, metrics, and alerting
  • Demonstrable experience in using AI to safely and securely enhance velocity, improve reliability and recoverability of services
  • Strong communication and interpersonal skills; is a team player with a positive attitude
  • Highly self‑motivated; ability to adapt and learn quickly in a fast‑paced environment…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary