Senior Site Reliability Engineer
Listed on 2025-12-22
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
- Senior Site Reliability Engineer with deep expertise in optimizing system reliability, performance, and scalability across cloud environments (Azure, Kubernetes, Service Mesh).
- Proficient in defining, measuring, and improving Service Level Objectives (SLOs), managing error budgets, and automating toil to drive operational excellence in a blameless culture.
- Remote-first opportunity for US-based employees with the option to work in-person out of our Manhattan office.
Join Zip’s Engineering function and put your name to solving fascinating challenges at scale in an agile, test-driven development environment. If you value good domain-driven design and enjoy delivering quality work at pace, you’ll be a great fit with the squads responsible for building cloud-native software applications that serve millions of customers and process billions of dollars in payments.
We are seeking a seasoned leader with extensive senior leadership experience to spearhead our Site Reliability Engineering (SRE) initiatives and mentor our engineering team. This role requires a deep understanding of operational excellence, managing production risk, and the ability to lead reliability initiatives from inception to completion. Collaboration is key in our environment, so we need someone who excels in a team-oriented setting.
As we aim to double our footprint this year, you will encounter complex challenges that demand innovative solutions and strategic insight to maintain and improve system reliability you are passionate about driving infrastructure excellence and nurturing talent within a dynamic SRE team, we would love to hear from you.
- Work within an infrastructure that is capable of handling billions of dollars in transactions quickly and securely
- Collaborate with engineering teams to design and deploy highly reliable and scalable integrated solutions for Fortune 100 companies.
- Develop automated upgrade systems for a constantly evolving Azure architecture
- Maintain a complex event sourcing environment using CQRS principles
- Develop self-service tooling and automation (e.g., using Terraform, Atlantis, ArgoCD) to empower development teams to operate services within established reliability standards and reduce toil.
- Monitor for service health and create automatic recoveries using metrics-based canaries to ensure reliable code deployment
- 10+ years of experience in a Site Reliability Engineering, Production Engineering, or equivalent role.
- 5+ years of experience working with Kubernetes or similar microservice architecture.
- 5+ years of experience working in an Azure environment
- Proven experience defining and implementing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and managing error budgets.
- Experience working in an agile environment and knowledge of agile practices
- Jira experience with project management and story creation is a plus
- Experience with CI/CD systems preferably using Azure Dev Ops or Git Hub Actions
- Strong understanding of networking and routing protocols especially those involved in Service Mesh architectures
- Experience incorporating AI tools such as ChatGPT, Cursor, Codex, or Git Hub CoPilot into your day to day work.
- Must be able to work in an on-call rotation with a focus on sustainable incident response and post-mortem analysis (blameless culture).
Zip is a place where you’ll get out what you put in. The newness of our sector means we need to move at pace and embrace change, and our promise to you when you join the team is that you’ll feel empowered and trusted to make big things happen quickly.
We want you to feel welcome and as though you have the support to be yourself, and care for yourself ause it’s important to us that you make the most of the opportunities you’ll get to grow your skills and your career, and be surrounded by smart, friendly people and leaders that have your back.
We think these are just some of the best things about being a Zipster. We will also offer you:
- Flexible working culture
- Incentive programs
- 20 days PTO every year
- Generous paid parental leave
- Leading family…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).