Member of Technical Staff, Site Reliablity Engineer
Listed on 2026-06-07
-
IT/Tech
IT Support, Systems Engineer
Voice AI that resolves, not transfers.
Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes.
The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth
The customers: Amazon Ring, Service Titan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500
The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M
99.99% call completion is the number this role drives. Vapi runs live phone calls — a p99 spike means callers drop. We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch.
This is not a bash-and-YAML role. You’ll ship code (Go or Type Script) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and worker pool-cron-scaler are on your plate.
30 Day
:
Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path.60 Day
:
Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / worker pool-cron-scaler.90 Day
:
Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or Type Script. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR.
Must-haves
You’ve run incident command and postmortem discipline at scale on a real oncall rotation.
You’ve operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
You’ve done capacity planning and load testing for production systems with real users.
You’re fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, Pod Disruption Budgets , graceful shutdown.
You know back pressure and autoscaling patterns — KEDA, custom metrics scaling.
Nice-to-haves
You ship code, not just scripts. You can build platform services in Go or Type Script (matches Vapi’s cluster-manager, database-health, wscaler, incident
Manager).Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.
Tech stack you’ll work in
Languages:
Go and Type Script (you ship code, not just scripts), Bash.Observability:
Chronosphere, Prometheus, Grafana, Datadog, Open Telemetry.Orchestration:
Kubernetes on EKS — production ops (HPA/VPA tuning, Pod Disruption Budgets , graceful shutdown, pod crash diagnosis).Autoscaling and back pressure: KEDA, custom metrics scaling (matches Vapi’s wscaler and worker pool-cron-scaler).
Load testing: script-based load testing, provider rate-limit auditing, per-org concurrency auditing.
Vapi services you’ll touch or build: cluster-manager, database-health, wscaler, incident
Manager.
Where you likely come from
A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, Live Kit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands‑on.
Weak fit: SRE from analytics or CRM backends where “degraded” means a slow dashboard, not a dropped call. Anyone uncomfortable reading or writing code.
Generational impact
:
Build the human interface for every businessOwnership culture
: 70% of the company are previous foundersKind team
:
The founders, Jordan and Nikhil, are CanadiansTier-1 Investors
: YC, KP seed, Bessemer Series A
Real stake
:
We offer a competitive salary and excellent equity ownershipComprehensive health coverage
: medical, dental, and vision plansTeam love
:
We love hanging out, and we do quarterly off‑sitesFlexible time off
: take what you need
More
: catered meals, transportation, gym, and a $10k annual L&D budget
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).