Senior SRE Job Palo Alto area,California USA,IT/Tech

At Pylon, we're a small team building a very ambitious product in the mortgage space.

At this early stage, we're looking for engineers who can see the opportunity of what we're building towards and want to have a hand in building it.

We're in search of people who find difficult problems invigorating and who fit well into a high-performing team built on mutual respect and reliance. If you like pushing yourself to learn a massive amount while shipping code that has a huge impact on the end product, Pylon Engineering could be a great place for you.

About the Job

You'll own reliability and operational excellence for Pylon's production systems. This means designing and implementing monitoring, alerting, and incident response processes that scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks, and ensure our platform can handle the demands of a regulated, high-stakes financial product.

This is not a pure ops role. At Pylon, we believe SRE work should be a maximum of 50% operational toil. If you're spending more than half your time firefighting and keeping things running, you're not doing SRE work, you're doing sysadmin work. The other 50%+ of your time should be spent writing code: building infrastructure tooling, automating away operational burden, making reliability improvements to core services, and creating internal developer productivity tools that make the entire team more effective.

SRE is about making things better, not just keeping them alive.

We're looking for someone who has operated production systems at scale in a professional engineering environment. You know what good looks like because you've built it before.

What We’re Looking For

Must-haves:

4+ years experience in SRE, infrastructure, or platform engineering roles
Experience working on a team of SREs at a company with mature SRE practices (not solo SRE roles)
Real on-call experience at scale in a large production environment (you've carried the pager and lived through incidents)
Deep AWS expertise (ECS, RDS, networking, security)
Strong experience with declarative infrastructure (Terraform, CDK, or similar)
Nix experience (we use it and want to expand its adoption)
Track record of building reliability tooling and automation
Can design and implement monitoring, alerting, and observability systems from first principles
Comfortable working in a regulated environment where "breaking things" is not an option

Nice-to-haves:

Experience at companies with strong SRE cultures (Google, Replit, Stripe, etc.)
Background in fintech, healthtech, or other regulated domains
Experience migrating monitoring systems or implementing SLOs
Contributions to infrastructure tooling or open source projects

Basics

Job title: Senior Site Reliability Engineer
Stock options: own a piece of the company and we all win together
Health insurance, 401K, dental, etc.

Our technology stack:

We don't require that you've worked with any of these technologies before, this is just our stack for your information:

Infrastructure: AWS (ECS, RDS, Cloud Front, Lambda), CDK for infrastructure-as-code
Observability: Honeycomb, Open Telemetry
CI/CD: Git Hub Actions, Nix for builds and dev environments
Core platform: Type Script/Node backend, Postgre

SQL, React frontend
Languages: Type Script, Python, Nix, SQL

About you

You:

Have operated production systems at scale. You've been on-call for a large, complex system. You know what 3am pages feel like and you've built systems to prevent them. You understand the difference between alerts that matter and noise.

Write code, not just YAML. You can build internal tools, automation, and reliability improvements. You're comfortable contributing to the core product when reliability requires it. You can read and understand the codebase you're responsible for keeping up.

Think in systems. You understand distributed systems, failure modes, cascading failures, and graceful degradation. You can diagnose production issues quickly and know when to elevate vs. when to fix.

Know your tools deeply. You've used observability platforms at scale and understand how to instrument systems properly. You can design…


Increase/decrease your Search Radius (miles)



Job Posting Language