Site Reliability Engineer
Listed on 2025-11-03
-
IT/Tech
Cloud Computing, Systems Engineer, SRE/Site Reliability, IT Support
We are seeking a highly skilled Site Reliability Engineer (SRE) with deep expertise in Kubernetes and cloud technologies AWS, Azure, or GCP.
The SRE will be responsible for designing, deploying, automating, and supporting highly available, scalable, and secure containerized applications in cloud-native environments. You will work closely with development, operations, and security teams to ensure the reliability, performance, and efficiency of our production systems.
Key Responsibilities- Design, deploy, and manage Kubernetes clusters on‑premises and/or cloud‑managed such as EKS, AKS, GKE to support scalable microservices architectures.
- Automate infrastructure provisioning and application deployment using Infrastructure as Code (IaC) tools such as Terraform, Helm, or Cloud Formation.
- Monitor, troubleshoot, and optimize system performance using observability tools.
- Implement and manage CI/CD pipelines to ensure rapid, repeatable, and reliable software delivery.
- Ensure system reliability, availability, and security through proactive monitoring, incident response, and root cause analysis.
- Develop and maintain runbooks, dashboards, and documentation for operational procedures and system architectures.
- Participate in on‑call rotations and respond to production incidents, ensuring minimal downtime and fast recovery.
- Collaborate with development and operations teams to drive Dev Ops and SRE best practices including capacity planning, scaling, and cost optimization.
- Continuously improve automation tooling and processes to reduce manual work and increase system reliability.
- 3 years experience as an SRE, Dev Ops Engineer, or similar role supporting large‑scale systems.
- Expertise in Kubernetes deployment, scaling, upgrades, troubleshooting, and networking.
- Hands‑on experience with at least one major cloud provider (AWS, Azure, or GCP).
- Proficiency in scripting/programming (Python, Bash, Go, etc.).
- Experience with IaC tools (Terraform, Helm, Cloud Formation, ARM, etc.).
- Strong knowledge of Linux systems administration and networking concepts.
- Familiarity with monitoring, logging, and ingestion tools (Prometheus, Grafana, ELK, EFK).
- Experience with CI/CD tools (Jenkins, Git Lab CI, ArgoCD, etc.).
- Understanding of security best practices in cloud and containerized environments.
- Excellent troubleshooting and problem‑solving skills.
- Strong communication and collaboration skills.
- Certified Kubernetes Administrator (CKA) or similar certification.
- Experience with service mesh (Istio, Linkerd), ingress controllers, and API gateways.
- Experience in a multicloud or hybrid cloud environment.
- Familiarity with Git Ops practices and tools (ArgoCD, Flux).
- Experience with disaster recovery, backup, and business continuity planning.
Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
This role is ideal for engineers who are passionate about automation, reliability, and modern cloud‑native architectures and who thrive in fast‑paced, collaborative environments.
We are an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, protected veteran status, or disability status.
#J-18808-LjbffrTo Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: