Senior Site Reliability Engineer
Listed on 2026-06-05
-
IT/Tech
Cloud Computing, SRE/Site Reliability
About Blitzy
Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work.
We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.
Location: Cambridge, MA (In-Office)
Compensation: $160,000 - $180,000 + equity eligibility based on performance
The RoleAs a Senior Site Reliability Engineer at Blitzy's Pune headquarters, you will be the backbone of our platform's reliability, scalability, and operational excellence. You'll work at the intersection of software engineering and infrastructure, ensuring our AI-powered development platform remains highly available and performant as we scale rapidly. This is a high-impact, hands‑on role for an engineer who thrives in a fast‑moving environment and takes deep ownership of the systems they build.
WhatSuccess Looks Like
- In 30 days:
You have a deep understanding of Blitzy's infrastructure architecture, have identified key reliability risks, and are actively contributing to on‑call rotations. - In 90 days:
You have shipped meaningful improvements to observability, incident response workflows, and deployment pipelines that measurably reduce MTTR and increase system uptime. - In 6 months:
You have driven at least one major reliability initiative from inception to production, established SLO/SLA frameworks for critical services, and are a trusted technical voice shaping our infrastructure roadmap.
- Design, build, and operate scalable, fault‑tolerant infrastructure across cloud environments (AWS, GCP, or Azure).
- Define and enforce SLOs, SLAs, and error budgets; lead blameless postmortems and drive systemic improvements.
- Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure.
- Own observability: design and maintain logging, metrics, tracing, and alerting stacks (e.g., Prometheus, Grafana, Datadog, Open Telemetry).
- Partner closely with software engineering teams to embed reliability practices into the development lifecycle.
- Drive capacity planning, performance benchmarking, and cost optimization across our infrastructure.
- Champion security best practices within the infrastructure and deployment layers.
- 5+ years of experience in Site Reliability Engineering, Dev Ops, or Infrastructure Engineering roles.
- Strong proficiency in at least one major cloud platform (AWS preferred); experience with Kubernetes and container orchestration at scale.
- Hands‑on experience with infrastructure‑as‑code tools (Terraform, Pulumi, or equivalent).
- Proven track record designing and maintaining high‑availability, distributed systems.
- Deep expertise in observability tooling, incident management, and on‑call practices.
- Strong scripting and automation skills (Python, Go, Bash, or similar).
- Excellent communication skills with the ability to collaborate across engineering teams and present technical findings to leadership.
- Experience supporting AI/ML workloads or GPU‑accelerated infrastructure.
- Prior experience in a high‑growth startup environment where you wore multiple hats.
- Familiarity with eBPF, service mesh technologies (Istio, Linkerd), or advanced networking.
- Contributions to open‑source SRE/Dev Ops tooling or communities.
- Experience building global, multi‑region infrastructure with strict latency and availability requirements.
You won't be maintaining legacy systems or fighting fires in a sprawling monolith. At Blitzy, you're building reliability into a greenfield AI platform that is redefining how the world creates software. You'll have direct influence over architectural decisions, work side‑by‑side with world‑class engineers, and see the tangible impact of your work as we scale to serve Fortune 500 customers. As a founding member of the Pune SRE team,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).