Senior Site Reliability Engineer/Senior DevOps
Listed on 2026-06-25
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
We are seeking an experienced Senior Site Reliability Engineer (SRE) to support and optimize mission-critical cloud and on-premises platforms across Microsoft Azure and highly secure air-gapped Kubernetes environments.
The ideal candidate will be responsible for ensuring the reliability, scalability, availability, security, and operational excellence of enterprise platforms running across Azure Kubernetes Service (AKS) and Rancher RKE2 environments. This role requires deep expertise in Kubernetes operations, infrastructure automation, Git Ops practices, observability, incident management, and cloud-native platform engineering.
You will collaborate closely with engineering, infrastructure, security, and operations teams to build resilient, highly available, and secure enterprise platforms while driving automation and continuous operational improvement.
Key Responsibilities- Ensure reliability, availability, scalability, and performance of services running across Azure AKS and air-gapped Kubernetes environments.
- Manage and optimize Kubernetes platforms, including ingress controllers, storage layers, networking, and stateful workloads.
- Design and implement infrastructure automation using Infrastructure-as-Code (IaC) and configuration management tools.
- Implement and maintain Git Ops deployment practices using ArgoCD and Kustomize.
- Build, manage, and optimize CI/CD pipelines using Azure Dev Ops and Git Hub Actions.
- Support container lifecycle management, image repositories, registry mirroring, and container security practices.
- Monitor infrastructure and application health using enterprise observability platforms.
- Troubleshoot complex production issues and drive root cause analysis initiatives.
- Lead incident response activities and improve operational resilience through preventive measures.
- Collaborate with platform, application, cloud, and security teams to enhance system reliability and architecture.
- Participate in on-call support for critical production environments.
- Support ITIL-aligned incident, change, and problem management processes.
- Contribute to enterprise governance, security, compliance, and operational best practices.
- Bachelor's Degree in Computer Science, Engineering, Information Technology, or a related field.
- 10+ years of experience in Site Reliability Engineering, Dev Ops, Platform Engineering, or Cloud Infrastructure roles.
- Strong hands-on experience with Microsoft Azure cloud platforms.
- Extensive experience managing Kubernetes environments, including AKS and Rancher RKE
2. - Strong automation and scripting capabilities using Python, Go, and Bash.
- Proven experience implementing Infrastructure-as-Code using Terraform, Bicep, and Ansible.
- Strong knowledge of Git Ops methodologies and modern CI/CD practices.
- Experience with enterprise monitoring, observability, and incident management.
- Excellent communication and stakeholder management skills.
- Experience working in air-gapped or highly regulated enterprise environments.
- Knowledge of container security and private registry management.
- Exposure to large-scale cloud transformation and modernization programs.
- Experience operating within Agile, Scrum, and ITIL environments.
- Familiarity with enterprise governance, risk, compliance, and audit requirements.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).