Senior Site Reliability Engineer Job New York New York USA,IT/Tech

Location: New York

Location:
New York City or Chicago (Hybrid)

A technology-driven investment firm is expanding its Platform Engineering organization and is seeking an experienced Senior Site Reliability Engineer to help shape reliability practices across its infrastructure and production environments.

This role offers the opportunity to build and scale SRE practices from the ground up
, partnering closely with platform, Dev Ops, and cloud engineering teams to drive reliability, performance, and operational maturity across a complex technology ecosystem. You will work across both cloud and on-premise environments
, supporting highly critical production systems including trading and data platforms. The role combines hands-on engineering with strategic influence
, helping define reliability standards and operational frameworks across the organization.

What You’ll Do

Help establish and evolve Site Reliability Engineering practices, standards, and operational processes across engineering teams
Design and scale observability and monitoring platforms using tools such as Prometheus, Grafana, Loki, Tempo, and Open Telemetry
Participate in a team-based on-call rotation (approximately one week per month) supporting critical production systems
Define reliability standards for applications running in Kubernetes environments
, ensuring optimal configuration for performance, cost, and resiliency
Build automation and tooling to improve deployment pipelines, system health monitoring, and recovery processes
Partner with engineering teams to improve service stability, scalability, and fault tolerance
Promote SRE best practices such as service level objectives (SLOs), incident reviews, and blameless post-mortems

What You Bring

8+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering roles
Experience operating large-scale distributed systems in production environments
Strong expertise with observability and monitoring platforms
, including Prometheus, Grafana, Loki, Tempo, and Open Telemetry
Deep understanding of containerization and orchestration technologies
, including Docker and Kubernetes
Experience working across cloud infrastructure (AWS preferred) and on-premise environments
Strong scripting and automation skills using Python, Bash, or Go
Experience building and maintaining CI/CD pipelines and modern Dev Ops workflows

What Makes You Stand Out

Passion for building reliable, scalable infrastructure and improving operational maturity
Ability to translate complex reliability concepts into practical engineering solutions
Strong collaboration skills when working across engineering, platform, and infrastructure teams
A mindset focused on automation, observability, and continuous improvement

Why This Role

Opportunity to define and build SRE practices from the ground up
Work on mission-critical infrastructure supporting high-performance systems
Collaborate with platform, cloud, and engineering teams building modern infrastructure at scale
High-impact role within a technology-focused financial environment

#J-18808-Ljbffr