Senior Site Reliability Engineer Job Cambridge area,Massachusetts USA,IT/Tech

About this position

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work.

We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA — Kendall Square HQ (In-Office)

The Role

As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.

What

Success Looks Like

Blitzy's platform maintains industry-leading uptime — incidents are rare, and when they occur, they are resolved quickly with clear root cause analysis and lasting fixes.
SLOs and error budgets are defined for every critical service and actively used to drive engineering decisions, not just tracked passively.
Observability is a first-class capability — engineers across the company have the dashboards, traces, and alerts they need to understand system behavior without asking SRE.
Deployment pipelines are fast, safe, and reliable — releases go out with confidence and rollbacks are automated when something goes wrong.
Infrastructure is entirely codified — no manual provisioning, no configuration drift, every environment reproducible from source.
Engineering teams are more productive because of your work — platform friction is low, developer tooling is sharp, and SRE is seen as an accelerant, not a gatekeeper.
You are a trusted technical leader at HQ, influencing how Blitzy thinks about reliability as we scale our platform and our team.

Areas of Ownership

Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, Open Telemetry).
Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.

Required Experience

5–8 years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering.
Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
Hands-on Terraform experience for infrastructure-as-code provisioning and management.
Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
Strong scripting and automation skills in Python, Go, or Bash.
Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.
Excellent…