Software Engineer,Reliability Job Los Angeles area,California USA,IT/Tech

Position: Staff Software Engineer, Reliability

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout‑free experiences in the real world. Today, we are reimagining parking to enable millions of consumers to just "drive in and drive out." We envision a future where people transact in the real world with a speed, ease and convenience that is unparalleled, even online. Tomorrow, we will power checkout‑free experiences anywhere you go to make everyday living, working and playing remarkable – giving us back our most valuable asset, time.

Who

you are

We are building a hyperscaler company and need someone to own reliability across the entire Metropolis platform. As a Staff or Senior Software Engineer focused on Reliability, you will establish and drive comprehensive reliability practices that ensure system availability, resilience, and observability for our mission‑critical mobility infrastructure serving millions of transactions. This is your opportunity to build reliability from first principles—architecting fail‑over systems, implementing chaos engineering practices, and improving the observability foundation that will enable Metropolis to scale to new markets while maintaining 99.9%+ uptime.

You will be the technical owner of our reliability posture, working on everything from multi‑region fail‑over architectures to incident response workflows to SLO‑based alerting strategies.

What you'll do

Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic fail‑over mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
Architect and build active‑passive or active‑active regional deployment strategies with database replication, automated fail‑over, and DNS‑based traffic routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO‑based alerting, on‑call rotation, and escalation policies; build service health dashboards that show customer impact
Own the incident management process including workflows, tooling, post‑mortem culture, runbook automation, and MTTR reduction initiatives—driving down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, back pressure mechanisms, and chaos engineering practices
Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages.

What we’re looking for

8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
Expert‑level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling. Scala experience is a big plus
Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
Strong systems thinking with proven ability to design and implement large‑scale, high‑availability distributed systems that handle significant load
Observability expertise including hands‑on production experience with metrics, logging, tracing, and alerting systems in high‑load environments
Database and data systems knowledge including relational databases, event streaming (Kafka, SQS), caching strategies, and data consistency patterns
Experience with AI‑powered development tools such as Claude Code, Git Hub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular
Excellent technical communication with ability to design and document complex systems, lead technical discussions, and collaborate across multiple teams local to New York City, Seattle, or Los Angeles area

While not required, these are a plus

SRE or Reliability Engineering experience at companies known for operational…


Increase/decrease your Search Radius (miles)



Job Posting Language