Site Reliability Engineering Lead
Job in
Cape Town, 7100, South Africa
Listed on 2026-01-30
Listing for:
Lulalend
Full Time
position Listed on 2026-01-30
Job specializations:
-
IT/Tech
IT Project Manager, Cloud Computing, Systems Engineer, SRE/Site Reliability
Job Description & How to Apply Below
We are seeking an experienced Site Reliability Engineering Lead to lead, mentor, and grow our SRE team. The ideal candidate will have a deep understanding of Microsoft Azure, cloud computing, and distributed systems.
As the SRE Lead, you will be responsible for the overall strategy and execution of our SRE function. You will guide your team to monitor, maintain, and improve our Azure-based infrastructure and applications, ensuring their reliability, scalability, and security.
KEY RESPONSIBILITIES:- Lead, mentor, and develop a high-performing SRE team, fostering a culture of ownership, collaboration, and continuous improvement.
- Manage the team's performance, including setting clear goals, conducting regular 1:1s, and supporting career development.
- Collaborate with the software engineering manager on the recruitment process to grow the SRE team, ensuring a high bar for technical skill and cultural fit.
- Own and manage the 24/7 on‑call rotation and incident response process, acting as a key escalation point and driving effective root cause analysis (RCA) and remediation plans.
- Define and drive the SRE technical roadmap, partnering with Engineers, Dev Ops, and Sec Ops to build and manage highly available, scalable, and resilient architectures on Azure.
- Oversee the platform's monitoring and alerting strategy, guiding the team to build a holistic view of infrastructure and application performance using tools like Azure Monitor.
- Champion automation by directing the team's development of scripts and tools to streamline deployment and management of Azure services.
- Drive platform optimisation by analysing performance metrics and evaluating new Azure features and services to improve workflows.
- Ensure the security of the Azure infrastructure by enforcing security policies and best practices in partnership with the Sec Ops team.
- Foster a culture of delivery, continuous improvement and innovation within the SRE team, encouraging experimentation.
- Matric certificate or equivalent.
- 5+ years of experience in a senior SRE, Dev Ops, or Cloud Infrastructure role, with deep knowledge of maintaining Azure infrastructure.
- Minimum 2+ years of formal people management and leadership experience.
- Demonstrable experience leading incident response and root cause analysis.
- Strong understanding of Azure services such as Web Applications, Functions, and Application Gateways.
- Strong experience with automation tools such as Power Shell, Azure CLI, and ARM templates.
- Deep experience with monitoring and logging tools such as Azure Monitor, Grafana or similar, Log Analytics, Application Insights, and Logic Apps.
- Excellent troubleshooting, problem‑solving, and strategic planning skills.
- Strong familiarity with Dev Ops practices and tools such as Jira and Ops Genie
- Monitoring & Observability:
Azure Monitor, Log Analytics and Grafana. - Operations & Incident Management:
Jira, Sentinel and Ops Genie.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×