Senior Site Reliability Engineer Job Gurugram area,Uttar Pradesh India,IT/Tech

Senior Site Reliability Engineer (SRE)
Summary
We are looking for a Senior Site Reliability Engineer (SRE) to build and operate scalable, reliable, and secure platform infrastructure. The ideal candidate will drive automation, observability, incident management, and cloud-native best practices to improve system reliability and operational excellence across distributed systems.

Roles & Responsibilities
Define and manage SLIs, SLOs, and error budgets for critical services
Design and enhance monitoring, logging, alerting, and tracing capabilities
Automate operational processes and improve platform efficiency
Participate in incident response, root cause analysis (RCA), and postmortem reviews
Support production environments through on-call rotations and reliability initiatives
Improve system performance, scalability, availability, and capacity planning
Collaborate with engineering teams to enhance application resiliency and operational readiness
Drive adoption of Infrastructure as Code (IaC) and CI/CD best practices
Maintain highly available, fault-tolerant, and secure cloud infrastructure

Skills
Strong Linux /Unix administration and Debugging skills
Proficiency in Python/Bash/Shell scripting and automation
Expertise in observability and monitoring tools such as Grafana , Prometheus , ELK , and New Relic
Strong expertise in AWS and cloud infrastructure management
Strong experience with log analysis and monitoring using ELK
Strong incident management, communication, and operational excellence mindset
Hands-on experience with Kubernetes, Docker, and container orchestration

Experience with Terraform and Infrastructure as Code practices
Strong understanding of networking, DNS, load balancing, and distributed systems

Experience with CI/CD tools such as Jenkins, Git Hub Actions, Git Lab CI, or ArgoCD

Qualifications
B.tech/B.E. Equivalent
4+ years of experience in SRE, Dev Ops, Platform Engineering, or Systems Engineering

Good to Have
Bachelor's degree in Computer Science, Engineering, or a related field
Cloud or Kubernetes certifications
Experience managing production incidents in high-availability environments
Exposure to multi-cloud architectures (AWS/GCP/Azure)