×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in Cambridge, Middlesex County, Massachusetts, 02140, USA
Listing for: Blitzy
Full Time position
Listed on 2026-05-08
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

About this position

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work.

We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA — Kendall Square HQ (In-Office)

The Role

As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.

What

Success Looks Like
  • Blitzy's platform maintains industry-leading uptime — incidents are rare, and when they occur, they are resolved quickly with clear root cause analysis and lasting fixes.
  • SLOs and error budgets are defined for every critical service and actively used to drive engineering decisions, not just tracked passively.
  • Observability is a first-class capability — engineers across the company have the dashboards, traces, and alerts they need to understand system behavior without asking SRE.
  • Deployment pipelines are fast, safe, and reliable — releases go out with confidence and rollbacks are automated when something goes wrong.
  • Infrastructure is entirely codified — no manual provisioning, no configuration drift, every environment reproducible from source.
  • Engineering teams are more productive because of your work — platform friction is low, developer tooling is sharp, and SRE is seen as an accelerant, not a gatekeeper.
  • You are a trusted technical leader at HQ, influencing how Blitzy thinks about reliability as we scale our platform and our team.
Areas of Ownership
  • Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
  • Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
  • Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
  • Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, Open Telemetry).
  • Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
  • Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
  • Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
  • Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.
Required Experience
  • 5–8 years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering.
  • Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
  • Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
  • Hands-on Terraform experience for infrastructure-as-code provisioning and management.
  • Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
  • Strong scripting and automation skills in Python, Go, or Bash.
  • Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.
  • Excellent…
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary