Site Reliability Engineer Job Gurgaon area,Uttar Pradesh India,IT/Tech

About the Job
The Site Reliability Engineering (SRE) team is responsible for ensuring the reliability, scalability, and performance of large-scale telecom and CPaaS platforms. This role combines software engineering and systems operations to build resilient, observable, and automated infrastructure that supports high-throughput messaging services. The team operates in a 24/7 environment and works closely with Engineering, CX and Products to maintain carrier-grade service reliability.

What you’ll be responsible for
Ensure high availability, performance, and reliability of CPaaS production systems speread across mutiple locations hosted over cloud and data centers
Own and improve SLIs, SLOs, and SLAs for messaging platforms and supporting services.
Monitor system health, latency, TPS, error rates, and delivery metrics using observability tools.
Participate in on-call rotations and handle production incidents with a focus on fast recovery and root cause analysis.
Deploy, configure, and optimize for high-throughput messaging (multiple channels)
Troubleshoot telecom-specific issues including DLR failures, encoding problems, TPS drops   and routing issues.
Work directly with multiple teams for integrations, testing, and incident resolution.
Perform packet-level analysis using tcpdump and Wireshark to diagnose network and protocol-level issues.
Write and maintain shell scripts and automation to eliminate repetitive operational tasks and reduce human intervention.
Contribute to infrastructure automation using tools like Ansible and CI/CD pipelines where applicable.
Improve deployment, configuration, and rollback processes for messaging services.
Design and enhance monitoring, alerting, and dashboards using tools such as Datadog, Site
24x7, ELK and Grafana.
Administer and troubleshoot   Linux based servers in production environments.
Manage and optimize MySQL and Mongo

DB databases including performance tuning, backups, and recovery.
Works on API's and webhooks across the product & services. Its enhancements and troubleshooting.
Maintain web and application servers such as Apache, Nginx, and jboss (Wild Fly)
Support cloud-based and virtualized environments with exposure to auto-scaling and containerization concepts.
Collaborate with engineering teams on release planning, production deployments, and post-release validation .
Lead or contribute to incident response & RCA   focusing on long-term reliability improvements.
Track issues, changes, and reliability work using Jira and related tools.

What you’d have
B.Tech / B.E in Computer Science or related field with 2–3 years of experience in SRE, Dev Ops, telecom, or CPaaS operations .
Hands-on experience with SMS gateways and messaging workflows.
Solid understanding of Linux systems, networking fundamentals, and production troubleshooting .
Strong experience with MySQL & Mongo

DB administration, queries, and performance optimization.
Proficiency in shell scripting and a mindset toward automation and reliability engineering.
Hands-on experience with tcpdump, Wireshark , and protocol-level troubleshooting.

Experience with monitoring, logging, and alerting systems (Datadog, ELK, Grafana, Site
24x7, etc.).
Familiarity with configuration management tools like Ansible and version control systems (Git).
Working knowledge of cloud platforms, virtualization, auto-scaling, and containerization .
Strong incident management, analytical thinking, and communication skills.
Certifications such as RHCE, AWS, or SRE-related credentials are a plus


Increase/decrease your Search Radius (miles)



Job Posting Language