Head of Reliability Engineering Job Gurugram area,Uttar Pradesh India,IT/Tech

Our client is looking for a Head of Reliability Engineering – Trading Infrastructure to lead scalable, fault-tolerant, and high-performance trading infrastructure for mission-critical real-time systems.

Key Responsibilities
Lead the reliability engineering function across trading infrastructure and production platforms.
Architect and operate highly available, fault-tolerant distributed systems supporting live trading environments.
Own infrastructure reliability, observability, scalability, deployment safety, and operational excellence across mission-critical systems.
Drive platform engineering initiatives across Kubernetes, CI/CD, infrastructure automation, runtime orchestration, and developer tooling.
Partner closely with trading, quant, and backend engineering teams to optimize latency, throughput, resiliency, and production stability.
Build and standardize monitoring, alerting, tracing, logging, failover testing, disaster recovery, and incident response frameworks.
Lead root cause analysis and resolution for complex production and distributed systems issues.
Strengthen infrastructure security, auditability, secrets management, and operational governance across trading environments.
Improve engineering productivity through automation, internal tooling, and infrastructure self-service capabilities.
Define operational best practices, reliability standards, release governance, and infrastructure lifecycle management processes.
Mentor and help scale the future reliability and platform engineering organization.

Required Experience
7–12 years of experience in Infrastructure Engineering, Reliability Engineering, SRE, Platform Engineering, or Distributed Systems environments.
Strong experience operating mission-critical production systems in high-availability environments.
Deep expertise in Linux systems, networking, and distributed infrastructure architecture.
Strong hands-on experience with Kubernetes and containerized production environments.
Strong programming ability in Go or Python.

Experience with Kafka, Terraform, Vault, Consul, CI/CD pipelines, and infrastructure automation frameworks.
Strong understanding of observability platforms including Prometheus, Alert manager, logging, and tracing systems.
Proven expertise debugging complex distributed systems and low-latency production environments.
Experience in trading systems, fintech, exchanges, HFT firms, or other real-time infrastructure environments is highly preferred.
Strong ownership mindset with the ability to operate in high-performance engineering environments