Job Description & How to Apply Below
Senior Site Reliability Engineer (SRE)
Location:
Hyderabad, India (Remote)
Looking for candidates who are local to Hyderabad.
Overview
As a Senior Site Reliability Engineer, you will design and implement highly reliable, scalable, and secure systems. You will lead incident response, improve operational processes, and mentor junior engineers. This role requires strong technical expertise in cloud infrastructure, automation, observability, and reliability engineering practices.
Key Responsibilities
Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical systems, ensuring alignment with business goals.
Monitor and improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detect) across services to enhance operational resilience.
Design scalable AWS-based systems, including multi-region deployments and disaster recovery strategies.
Develop reusable, versioned Terraform modules and maintain scalable Ansible configurations.
Design and maintain reliable CI/CD pipelines using tools like Harness.
Build and optimize observability systems using Dynatrace and other platforms for proactive monitoring and root cause analysis.
Lead incident response efforts, conduct root cause analysis, and implement process improvements.
Manage production databases (SQL/No
SQL), including replication, failover, and performance tuning.
Implement security controls and ensure compliance with organizational standards.
Conduct capacity planning and performance tuning for critical systems.
Collaborate with application support teams and manage vendor relationships to ensure timely resolution of issues and adherence to SLAs.
Mentor engineers, collaborate across teams, and influence reliability practices organization-wide.
Required Skills and Experience
Mandatory
Skills:
Advanced Linux/Windows tuning and hardening.
Strong proficiency in Bash and Python for production automation.
Expertise in AWS services and scalable architecture design.
Hands-on experience with Terraform and Ansible.
Proficiency in pipeline design and release engineering.
Experience with Dynatrace, Prometheus, Grafana, ELK, or similar platforms.
Strong understanding of SLOs, SLIs, error budgets, and operational metrics like MTTR and MTTD.
Good tHave:
Proven ability lead and improve incident management processes.
Ability coordinate with third-party vendors for application support and issue resolution.
Knowledge of security best practices and compliance frameworks.
Strong leadership, mentoring, and cross-team collaboration abilities.
Preferred Qualifications
Experience with multi-account AWS architecture.
Familiarity with automation and self-healing systems.
Knowledge of performance modeling and long-term capacity forecasting.
Certifications in AWS, Terraform, or SRE practices.
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×