More jobs:
Job Description & How to Apply Below
Summary
We are looking for a Senior Site Reliability Engineer (SRE) to build and operate scalable, reliable, and secure platform infrastructure. The ideal candidate will drive automation, observability, incident management, and cloud-native best practices to improve system reliability and operational excellence across distributed systems.
Roles & Responsibilities
Define and manage SLIs, SLOs, and error budgets for critical services
Design and enhance monitoring, logging, alerting, and tracing capabilities
Automate operational processes and improve platform efficiency
Participate in incident response, root cause analysis (RCA), and postmortem reviews
Support production environments through on-call rotations and reliability initiatives
Improve system performance, scalability, availability, and capacity planning
Collaborate with engineering teams to enhance application resiliency and operational readiness
Drive adoption of Infrastructure as Code (IaC) and CI/CD best practices
Maintain highly available, fault-tolerant, and secure cloud infrastructure
Skills
Strong Linux /Unix administration and Debugging skills
Proficiency in Python/Bash/Shell scripting and automation
Expertise in observability and monitoring tools such as Grafana , Prometheus , ELK , and New Relic
Strong expertise in AWS and cloud infrastructure management
Strong experience with log analysis and monitoring using ELK
Strong incident management, communication, and operational excellence mindset
Hands-on experience with Kubernetes, Docker, and container orchestration
Experience with Terraform and Infrastructure as Code practices
Strong understanding of networking, DNS, load balancing, and distributed systems
Experience with CI/CD tools such as Jenkins, Git Hub Actions, Git Lab CI, or ArgoCD
Qualifications
B.tech/B.E. Equivalent
4+ years of experience in SRE, Dev Ops, Platform Engineering, or Systems Engineering
Good to Have
Bachelor's degree in Computer Science, Engineering, or a related field
Cloud or Kubernetes certifications
Experience managing production incidents in high-availability environments
Exposure to multi-cloud architectures (AWS/GCP/Azure)
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×