Senior Site Reliability Engineer
Listed on 2026-04-23
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing, Network Engineer
Location:
New York City or Chicago (Hybrid)
A technology-driven investment firm is expanding its Platform Engineering organization and is seeking an experienced Senior Site Reliability Engineer to help shape reliability practices across its infrastructure and production environments.
This role offers the opportunity to build and scale SRE practices from the ground up
, partnering closely with platform, Dev Ops, and cloud engineering teams to drive reliability, performance, and operational maturity across a complex technology ecosystem. You will work across both cloud and on-premise environments
, supporting highly critical production systems including trading and data platforms. The role combines hands-on engineering with strategic influence
, helping define reliability standards and operational frameworks across the organization.
- Help establish and evolve Site Reliability Engineering practices, standards, and operational processes across engineering teams
- Design and scale observability and monitoring platforms using tools such as Prometheus, Grafana, Loki, Tempo, and Open Telemetry
- Participate in a team-based on-call rotation (approximately one week per month) supporting critical production systems
- Define reliability standards for applications running in Kubernetes environments
, ensuring optimal configuration for performance, cost, and resiliency - Build automation and tooling to improve deployment pipelines, system health monitoring, and recovery processes
- Partner with engineering teams to improve service stability, scalability, and fault tolerance
- Promote SRE best practices such as service level objectives (SLOs), incident reviews, and blameless post-mortems
- 8+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering roles
- Experience operating large-scale distributed systems in production environments
- Strong expertise with observability and monitoring platforms
, including Prometheus, Grafana, Loki, Tempo, and Open Telemetry - Deep understanding of containerization and orchestration technologies
, including Docker and Kubernetes - Experience working across cloud infrastructure (AWS preferred) and on-premise environments
- Strong scripting and automation skills using Python, Bash, or Go
- Experience building and maintaining CI/CD pipelines and modern Dev Ops workflows
- Passion for building reliable, scalable infrastructure and improving operational maturity
- Ability to translate complex reliability concepts into practical engineering solutions
- Strong collaboration skills when working across engineering, platform, and infrastructure teams
- A mindset focused on automation, observability, and continuous improvement
- Opportunity to define and build SRE practices from the ground up
- Work on mission-critical infrastructure supporting high-performance systems
- Collaborate with platform, cloud, and engineering teams building modern infrastructure at scale
- High-impact role within a technology-focused financial environment
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).