Senior Site Reliability Engineer Job San Francisco area,California USA,IT/Tech

About the Job

You’ll own reliability and operational excellence for Pylon's production systems. This means designing and implementing monitoring, alerting, and incident response processes that scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks, and ensure our platform can handle the demands of a regulated, high-stakes financial product.

This is not a pure ops role. At Pylon, we believe SRE work should be a maximum of 50% operational toil. If you're spending more than half your time firefighting and keeping things running, you're not doing SRE work, you're doing sysadmin work. The other 50%+ of your time should be spent writing code: building infrastructure tooling, automating away operational burden, making reliability improvements to core services, and creating internal developer productivity tools that make the entire team more effective.

SRE is about making things better, not just keeping them alive.

We're looking for someone who has operated production systems at scale in a professional engineering environment. You know what good looks like because you've built it before.

What We’re Looking For

Must-haves:

4+ years experience in SRE, infrastructure, or platform engineering roles
Experience working on a team of SREs at a company with mature SRE practices (not solo SRE roles)
Real on-call experience at scale in a large production environment (you’ve carried the pager and lived through incidents)
Deep AWS expertise (ECS, RDS, networking, security)
Strong experience with declarative infrastructure (Terraform, CDK, or similar)
Nix experience (we use it and want to expand its adoption)
Track record of building reliability tooling and automation
Can design and implement monitoring, alerting, and observability systems from first principles
Comfortable working in a regulated environment where “breaking things” is not an option

Nice-to-haves:

Experience at companies with strong SRE cultures (Google, Replit, Stripe, etc.)
Background in fintech, healthtech, or other regulated domains
Experience migrating monitoring systems or implementing SLOs
Contributions to infrastructure tooling or open source projects

Our technology stack:

We don’t require that you’ve worked with any of these technologies before, this is just our stack for your information:

Infrastructure: AWS (ECS, RDS, Cloud Front, Lambda), CDK for infrastructure-as-code
Observability: Honeycomb, Open Telemetry
CI/CD: Git Hub Actions, Nix for builds and dev environments
Core platform: Type Script/Node backend, Postgre

SQL, React frontend
Languages: Type Script, Python, Nix, SQL

About you

You:

Have operated production systems at scale. You’ve been on-call for a large, complex system. You know what 3am pages feel like and you've built systems to prevent them. You understand the difference between alerts that matter and noise.

Write code, not just YAML. You can build internal tools, automation, and reliability improvements. You're comfortable contributing to the core product when reliability requires it. You can read and understand the codebase you're responsible for keeping up.

Think in systems. You understand distributed systems, failure modes, cascading failures, and graceful degradation. You can diagnose production issues quickly and know when to elevate vs. when to fix.

Know your tools deeply. You've used observability platforms at scale and understand how to instrument systems properly. You can design alerting that has high signal and low noise. You know AWS inside and out.

Have strong opinions that you're willing to defend. We have a culture of vigorous discussion and debate on technical decisions. We'll push you to defend your choices, and we want you to push back.

Don’t settle. Challenge yourself to frequently and consistently deliver exceptional work. If something could be more reliable, take the initiative to improve it.

Have great ideas, and lots of them. You should see opportunities all around you to make the infrastructure, tooling and processes better. We'll give you an environment where you can act on those ideas.

Are self-motivated. You can take a goal and drive towards it without…