Job Description & How to Apply Below
We are seeking a highly skilled Site Reliability Engineer (SRE) with strong experience in Kubernetes troubleshooting, incident response, and deep knowledge of monitoring and alerting systems, along with solid experience in CI/CD pipeline design and maintenance. You will play a key role in building and maintaining reliable infrastructure, enhancing observability, and ensuring uptime for mission-critical systems.
In this role, you will...
Diagnose and resolve issues in Kubernetes clusters, including deployments, pod failures, networking issues, and autoscaling.
Lead incident management efforts including on-call response, root cause analysis, and continuous improvement of incident playbooks.
Design and maintain monitoring, logging, and alerting systems using tools such as Prometheus, Grafana, and ELK (Elasticsearch, Logstash, Kibana).
Set up and manage Kibana dashboards and maintain the ELK stack to ensure high availability and performance of logging infrastructure.
Integrate metrics, logs, and traces into a unified observability platform.
Build and maintain alerting pipelines to reduce noise and improve signal-to-noise ratio for production incidents.
Contribute to infrastructure automation using tools like Terraform, Helm.
Set up and support CI/CD pipelines for automated testing, deployment, and rollback across multiple environments.
Participate in shift rotations and continuously improve observability and response systems.
You've Got What It Takes If You Have...
2+ years in an SRE, Dev Ops, or Infrastructure Engineer role.
Bachelor's degree in computer science, IT, or related technical field.
Hands-on experience on AWS and GCP Cloud
Deep hands-on experience with Kubernetes (EKS, AKS, GKE)
Strong understanding of Linux internals, container orchestration, and microservice architecture.
Hands-on experience with monitoring/logging tools:
Prometheus, Grafana, InfluxDB
ELK stack (Elasticsearch, Logstash, Kibana)
Proficient in incident response and alerting tools (Pager Duty etc.).
Basic knowledge of:
Kafka - topic monitoring, consumer health
Elasti Cache / Redis - caching patterns and troubleshooting
Influx
DB - time-series metrics storage
Experience writing and maintaining automation scripts in Bash, Python, or Go.
#LI-Onsite
]]>
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×