×
Register Here to Apply for Jobs or Post Jobs. X

Software Engineer, Reliability

Job in Seattle, King County, Washington, 98127, USA
Listing for: The Rundown AI, Inc.
Full Time position
Listed on 2025-12-20
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Salary/Wage Range or Industry Benchmark: 150000 - 200000 USD Yearly USD 150000.00 200000.00 YEAR
Job Description & How to Apply Below
Position: Staff Software Engineer, Reliability

Who we are

Metropolis is an artificial intelligence company that uses computer vision technology to enable frictionless, checkout-free experiences in the real world. Today, we are reimagining parking to enable millions of consumers to just "drive in and drive out." We envision a future where people transact in the real world with a speed, ease and convenience that is unparalleled, even online. Tomorrow, we will power checkout-free experiences anywhere you go to make the everyday experiences of living, working and playing remarkable - giving us back our most valuable asset, time.

Who

you are

We are building a hyperscaler company and need someone to own reliability across the entire Metropolis platform. As a Staff or Senior Software Engineer focused on Reliability, you'll establish and drive the comprehensive reliability practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure serving millions of transactions.

This is your opportunity to build reliability from first principles – architecting failover systems, implementing chaos engineering practices, and improving the observability foundation that will enable Metropolis to scale to new markets while maintaining 99.9%+ uptime. You'll be the technical owner of our reliability posture, working on everything from multi-region failover architectures to incident response workflows to SLO-based alerting strategies.

Our platform handles real-time payment processing, customer authentication, and parking facility operations – systems that cannot go down. You'll tackle challenges like external service failover, dependency mirroring to prevent upstream outages, database replication and automatic promotion, and building the monitoring and alerting infrastructure that ensures we detect and respond to issues in minutes, not hours.

If you're energized by the challenge of ensuring system reliability at scale, building robust failover mechanisms, implementing comprehensive observability, and establishing the practices that prevent incidents before they occur, this role is for you. You'll work alongside highly technical teams across the organization, influencing architecture decisions and establishing reliability standards that affect every service we build.

What you'll do
  • Own the overall reliability posture for the Metropolis platform, establishing practices, metrics, and systems that ensure 99.9%+ uptime across all services
  • Design and implement automatic failover mechanisms for critical external dependencies (Twilio for SMS/voice, Stripe for payments) with circuit breakers, retry policies, and degraded mode operations
  • Architect and build active-passive or active-active regional deployment strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
  • Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation; implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies; build service health dashboards that show customer impact
  • Own the incident management process including workflows, tooling, post-mortem culture, runbook automation, and MTTR reduction initiatives – driving down mean time to recovery from detection to resolution
  • Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, back pressure mechanisms, and chaos engineering practices
  • Build and maintain local mirrors for critical dependencies (Maven/NPM/Docker registries) with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages.
What we're looking for
  • 8+ years of backend software engineering experience with deep focus on distributed systems and platform infrastructure
  • Expert-level Java proficiency with deep understanding of JVM performance, concurrency, and ecosystem tooling. Scala experience is a big plus
  • Production experience with microservices architecture, container orchestration (Kubernetes), and cloud platforms (AWS)
  • Strong systems thinking with proven ability to design and…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary