Site Reliability Engineer Job New York New York USA,IT/Tech

Location: New York

Berkley Hunt has partnered with a high-growth fintech company to hire a Site Reliability Engineer to help build, operate, and scale a globally distributed, highly available cloud platform. This role focuses on reliability, automation, and operational excellence, working closely with engineering teams to ensure systems are resilient, scalable, and production-ready from day one.

Hybrid In Manhattan Who You Are:

You think in systems, not silos, you naturally connect infrastructure decisions to customer experience and business impact.
You have strong experience running production environments at scale and understand what “good” looks like in terms of uptime, latency, and reliability.
You’re confident operating Kubernetes in real-world production settings, not just deploying to it.
You have a solid background in cloud architecture across AWS and GCP, and understand the trade-offs of distributed systems.
You are proactive about identifying risk and eliminating single points of failure before they become incidents.
You are comfortable working in fast-paced environments where priorities evolve and ownership is shared.
You believe infrastructure should be repeatable, observable, and continuously improving.

Responsibilities:

Architect and evolve cloud infrastructure to support a secure, highly available, and globally distributed fintech platform.
Embed reliability best practices into the development lifecycle, influencing design decisions before code reaches production.
Drive improvements in deployment workflows through Git Ops and Infrastructure-as-Code methodologies.
Enhance system visibility by building robust monitoring, logging, and alerting frameworks.
Lead incident response efforts, conduct post-incident reviews, and implement preventative measures to strengthen platform resilience.
Continuously refine Kubernetes environments to improve performance, scalability, and operational efficiency.
Partner cross-functionally with engineering and product teams to balance speed of delivery with operational stability.
Reduce operational toil by identifying automation opportunities and improving internal tooling.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language