Production Reliability Engineer
Listed on 2026-02-12
-
IT/Tech
Systems Engineer, SRE/Site Reliability, IT Support, Cloud Computing
A top global trading firm is seeking a Production Reliability Engineer to join its Central Operations and Reliability Engineering team within Production Infrastructure. This group plays a critical role in managing and supporting a real-time, high-performance trading environment operating at global scale. The role sits at the intersection of infrastructure, automation, and live production support, with direct responsibility for system reliability, performance, and operational risk in a mission-critical environment.
What You’ll Do
- Own and improve a large-scale production environment with a focus on reliability, performance, and operability
- Proactively monitor and troubleshoot distributed, latency-sensitive systems
- Build and maintain Dev Ops and automation tooling across configuration management, deployments, monitoring, data collection, and analysis
- Use system and operational metrics to improve scalability and stability
- Partner with engineers, operators, and stakeholders to investigate and resolve complex system issues
- Coordinate production changes and manage incidents in collaboration with risk and operational support teams
- Communicate directly with end users to manage incidents and drive technology improvements
- Support reconciliation workflows related to system output and downstream processes
- Evaluate and manage operational risk for production changes
- Define, document, and continuously refine operational procedures
- Mentor and support other reliability and operations engineers
- Participate in shared operational and on-call responsibilities
What You Bring
- Degree in Computer Science, Engineering, or equivalent professional experience
- 5+ years in Dev Ops, SRE, Linux Systems Engineering, or Network Engineering roles
- 3+ years of experience with Python and shell scripting. Familiarity with C++ is a plus
- Strong Linux expertise, including system internals, performance tuning, and system/network configuration
- Solid understanding of networking fundamentals (routing, multicast, VLANs, Ethernet)
- Detail-oriented mindset with a strong sense of ownership and urgency
- Ability to support periodic on-call duties with reliable availability
Why This Role
You’ll work on high-impact systems where reliability matters, in a highly technical environment that values ownership, precision, and continuous improvement. The role offers deep exposure to complex infrastructure and real-time problem-solving without bureaucracy or brand-driven distractions.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).