Site Reliability Engineer Job Hyderabad area,Telangana India,IT/Tech

Skills Required:

SQL, NOSQL, Nagios, Cloudwatch, Zabbix, Datadog, New Relic, Prometheus, Grafana,
App Dynamics, Site
24x7, Telemetry, Splunk, CI CD, CI/CD, CICD, Dev Ops, Kentico,
SRE, Site Reliability, AIOps, Agentic, GEN AI, AI, ML

Experience Range:
10 - 16 years

Key Responsibilities:

• Design, develop and maintain observability, monitoring, and alerting systems for AI
platforms and mission-critical backend services.

• Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools
such as Splunk, Prometheus, Grafana, and Open Telemetry.

• Define and maintain SLOs, SLIs, and real-time health indicators across platform
services and APIs.

• Participate in on-call rotations and lead the resolution of high-impact incidents,
including root cause analysis and postmortem reporting.

• Collaborate with platform engineering teams to enforce governance, compliance, and
security standards in production environments.

• Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g.,
Git Lab).

• Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ,
databases, and distributed APIs.

• Support capacity planning, cost analysis, and system tuning to improve platform
performance.

• Advocate for automation-first operations, reducing manual toil through scripting and
reliability tooling.

• Create and maintain documentation, runbooks, and knowledge-sharing resources
across SRE and engineering teams.

• Mentor junior engineers and foster a culture of technical rigor and continuous
improvement.

Qualifications:

• Bachelor’s degree in computer science, Engineering, or a related field (Master’s
preferred).

• 10+ years of experience in SRE, Dev Ops, or operations engineering in cloud-based
environments.

• Hands-on experience with monitoring, alerting, and incident response in distributed
systems.

• Strong coding and scripting skills in Python, Java, or shell scripting languages such as
Bash or Power Shell.

• Solid understanding of database principles and experience with distributed storage
solutions such as Oracle, Cassandra, SOLR, and Kafka.

• Proficiency in CI/CD pipelines and Git Lab workflows.

• Strong working knowledge of SQL and No

SQL databases, including Oracle and
Cassandra.

• Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and
troubleshooting large-scale environments.

• Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.

• Expertise in observability tools such as Splunk, Grafana, and Prometheus.

• Experience with Kubernetes, container orchestration, and hybrid/multi-cloud
deployments (Azure preferred; AWS/GCP/OCI acceptable).

• Deep understanding of security concepts and protocols, including authentication,
authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.

• Excellent knowledge of ITIL/Service Now terminology for incident and problem
management.

• Proven ability to work in fast-paced, incident-driven environments with high uptime
requirements.

Preferred Qualifications:

• Experience supporting AI workloads, model inference systems, or LLM-enabled
platforms.

• Exposure to AIOps or related ML platform observability and reliability practices.

• Familiarity with Lang Chain, OpenAI, Spring AI, and MCP Server is a strong plus.

• Experience in highly regulated telecom environments with compliance and audit
controls.

• Understanding of AI Gateway patterns and secure API orchestration.

• Background in building secure, zero-downtime platforms with enterprise-scale SLAs.
Knowledge, Skills, and Abilities:

• Strong grasp of SRE best practices, including SLOs, SLIs, postmortems, and chaos
engineering.

• Ability to diagnose system bottlenecks across infrastructure, application, and network
layers.

• Expertise in driving automation across observability, configuration, and deployment
domains.

• Excellent communication and collaboration skills in cross-functional technical teams.

• Curiosity-driven mindset with a passion for learning emerging AI technologies and
improving system reliability.

• Strong commitment to automating processes for proactive monitoring, anomaly
detection, and alerting.


Increase/decrease your Search Radius (miles)



Job Posting Language