×
Register Here to Apply for Jobs or Post Jobs. X

Senior Site Reliability Engineer

Job in 400001, Mumbai, Maharashtra, India
Listing for: MyOperator
Full Time position
Listed on 2026-05-15
Job specializations:
  • IT/Tech
    Systems Engineer, SRE/Site Reliability, Cloud Computing, IT Support
Job Description & How to Apply Below
Role Overview

We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production reliability, observability, and performance engineering across MyOperator’s AI-powered communication infrastructure.

This role is not operational-only — it requires strong system design thinking, deep troubleshooting ability, and a production ownership mindset. You will define reliability standards, build observability frameworks, lead incident response, and drive SLO-based engineering practices across distributed AWS and Kubernetes environments.

About My Operator

MyOperator is a Business AI Operator platform that enables businesses, teams, and AI agents to work together seamlessly for customer operations such as Sales, Support, Escalations, Feedback, and Refund processes. With 12,000+ businesses using our platform, we operate at meaningful scale and power mission-critical communication workflows including voice bots, Whats App automation, and intelligent call routing. We are building for reliability, speed, and impact.

MyOperator values ownership, critical thinking, and execution. This is a high-expectation, high-learning environment where engineers are empowered to solve complex problems and build systems that directly affect customer outcomes.

Key Responsibilities

- Own production reliability, uptime, latency, and error budgets across critical services.
- Design and manage production-grade monitoring using Grafana, Victoria Metrics (Prometheus), and AWS Cloud Watch.
- Define and enforce SLIs, SLOs, and SLA thresholds for AI communication systems (voice bots, Whats App APIs, call routing).
- Build real-time operational dashboards for incident response, capacity planning, and leadership visibility.
- Implement end-to-end distributed tracing using Open Telemetry (OTEL Collector).
- Design and maintain centralized logging with strong correlation between logs, metrics, and traces.
- Create SLO-based alerting systems with minimal noise and fast incident detection.
- Lead incident response lifecycle: alert triage, mitigation, RCA documentation, and preventive improvements.
- Drive MTTR reduction through structured monitoring, automation, and reliability engineering practices.
- Monitor and troubleshoot AWS EKS (Kubernetes) production workloads.
- Instrument and monitor LLM API integrations, AI inference pipelines, and messaging systems.
- Analyze logs using Open Search / ELK for anomaly detection and root cause identification.
- Automate operational workflows using Python or Bash to eliminate manual toil.
- Drive performance optimization, scalability improvements, and capacity planning.
- Collaborate with engineering teams to instrument new services from day one.

Required

Skills & Qualifications

- 3–6 years of experience in Site Reliability Engineering, Dev Ops, or Platform Engineering roles.
- Hands-on experience with:
Victoria Metrics / Prometheus (time-series monitoring), Grafana dashboards and visualization & PromQL for writing complex queries and alerts
- Experience implementing distributed tracing using Open Telemetry (Mandatory).
- Strong experience with centralized logging systems (ELK / Open Search / Loki).

- Experience with alerting frameworks such as Alert manager or Grafana Alerts.
- Strong understanding of SLIs, SLOs, SLA design, and reliability engineering principles.
- Hands-on experience managing AWS production workloads (EC2, RDS, ELB, Cloud Watch, IAM).

- Experience with Kubernetes (AWS EKS preferred).
- Good understanding of Linux systems, networking, and cloud infrastructure.
- Experience handling production incidents and participating in on-call rotations.
- Ability to automate operational tasks using Python or Bash.

Good to Have

- Experience with Open Search / ELK log pipelines and anomaly detection.
- Kubernetes monitoring (pod health, node metrics, autoscaling behavior).
- CI/CD observability integration (Jenkins, Git Hub Actions).
- Experience monitoring LLM APIs and AI inference pipelines.
- Familiarity with MLOps or AI observability tools (Arize, Why Labs, etc.).
- Service mesh exposure (Istio).
- Infrastructure as Code (Terraform, Cloud Formation).

- Experience with chaos engineering or load testing tools.
- Multi-cluster or multi-region architecture exposure.

Key Expectations

- Ownership of production systems and high availability.
- Strong troubleshooting and debugging skills.
- Focus on automation and reliability improvements.
- Proactive approach to incident prevention.
- Ability to reduce alert noise and improve signal quality.
- Data-driven approach to reliability engineering.

This Role Is Not For

- Candidates with purely development experience and no production ownership.
- Candidates without real incident response or on-call experience.
- Freshers or candidates with less than 3 years of experience.
Position Requirements
10+ Years work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary