Site Reliability Engineering Lead Job Cape Town area,South Africa,IT/Tech

We are seeking an experienced Site Reliability Engineering Lead to lead, mentor, and grow our SRE team. The ideal candidate will have a deep understanding of Microsoft Azure, cloud computing, and distributed systems.

As the SRE Lead, you will be responsible for the overall strategy and execution of our SRE function. You will guide your team to monitor, maintain, and improve our Azure-based infrastructure and applications, ensuring their reliability, scalability, and security.

KEY RESPONSIBILITIES:

Lead, mentor, and develop a high-performing SRE team, fostering a culture of ownership, collaboration, and continuous improvement.
Manage the team's performance, including setting clear goals, conducting regular 1:1s, and supporting career development.
Collaborate with the software engineering manager on the recruitment process to grow the SRE team, ensuring a high bar for technical skill and cultural fit.
Own and manage the 24/7 on‑call rotation and incident response process, acting as a key escalation point and driving effective root cause analysis (RCA) and remediation plans.
Define and drive the SRE technical roadmap, partnering with Engineers, Dev Ops, and Sec Ops to build and manage highly available, scalable, and resilient architectures on Azure.
Oversee the platform's monitoring and alerting strategy, guiding the team to build a holistic view of infrastructure and application performance using tools like Azure Monitor.
Champion automation by directing the team's development of scripts and tools to streamline deployment and management of Azure services.
Drive platform optimisation by analysing performance metrics and evaluating new Azure features and services to improve workflows.
Ensure the security of the Azure infrastructure by enforcing security policies and best practices in partnership with the Sec Ops team.
Foster a culture of delivery, continuous improvement and innovation within the SRE team, encouraging experimentation.

THE EXPERIENCE WE’RE LOOKING FOR

Matric certificate or equivalent.
5+ years of experience in a senior SRE, Dev Ops, or Cloud Infrastructure role, with deep knowledge of maintaining Azure infrastructure.
Minimum 2+ years of formal people management and leadership experience.
Demonstrable experience leading incident response and root cause analysis.
Strong understanding of Azure services such as Web Applications, Functions, and Application Gateways.
Strong experience with automation tools such as Power Shell, Azure CLI, and ARM templates.
Deep experience with monitoring and logging tools such as Azure Monitor, Grafana or similar, Log Analytics, Application Insights, and Logic Apps.
Excellent troubleshooting, problem‑solving, and strategic planning skills.
Strong familiarity with Dev Ops practices and tools such as Jira and Ops Genie
Monitoring & Observability:
Azure Monitor, Log Analytics and Grafana.
Operations & Incident Management:
Jira, Sentinel and Ops Genie.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language