Site Reliability Engineering Manager
Listed on 2026-02-10
-
IT/Tech
Cloud Computing, SRE/Site Reliability
Overview
At MHR, our employees are central to our success and play a key role in helping customers achieve sustainable high performance. With a team of over 900 professionals, we work to make things flow smoothly, whether it's for large organisations or individual employees. As businesses face rapid changes in the world of work, our team is here to help them adapt and thrive.
By focusing on the core needs of efficiency, productivity, growth, and impact, our employees use their expertise to deliver real solutions through our People and Finance platform. This system, which covers finance, HR, payroll, and learning, helps businesses run more smoothly and make better decisions in real time.
With over 40 years of experience behind us, MHR’s track record as a high-performance organisation is built on clear goals, a shared vision, and strong communication - all of which we pass on to our customers.
MHR is more than just a place to work; it’s a platform for empowerment. Joining us means bringing innovation, technology, and teamwork seamlessly removes obstacles, enhances your skills, and allow you to focus on what’s most important to you- work that matters.
With us, you’ll grow, find your flow, and make a lasting difference in your career, your team, and your impact.
YOUR CAREERAre you ready to be at the forefront of cutting-edge technology, shaping the future of tech? Join our dynamic team and unleash your potential in an environment that fosters creativity, collaboration and continuous growth. We’re looking for a Site Reliability Engineering Manager to join our growing Cloud Operations team at MHR.
YOUR TEAMLeading the Cloud Operations team, you will play a vital role in supporting the People First SaaS platform, a modern, microservices-based HR and payroll solution built in Azure and delivered to hundreds of customers.
YOUR IMPACTYou will lead and develop a high-performing Cloud Operations team responsible for the reliability, scalability and automation of MHR’s People First platform in Azure.
The role combines technical excellence, leadership and strategic direction, ensuring the platform operates with high resilience, observability and efficiency aligned to Site Reliability Engineering (SRE) principles.
- Lead, mentor and develop a high performing Cloud Operations team, setting operational standards and driving a culture of reliability and continuous improvement.
- Own monitoring, alerting and observability frameworks, using SLIs/SLOs and operational data to guide service reliability, reduce MTTR and improve platform health.
- Drive automation across environment builds, deployments, configuration management and operational tooling to deliver consistent, scalable and efficient operations.
- Collaborate with Platform, Development and Architecture teams to ensure solutions are designed for operability, resilience, capacity readiness and disaster recovery.
- Lead incident response and root cause analysis, identifying preventive measures while contributing to governance, cost optimisation and participation in the on call rota.
THE ROLE AND MHR
- Experience leading and mentoring engineers, fostering a culture of reliability, performance and operational excellence.
- Understanding of how Java/.NET service architectures influence deployments, change management and rollback safety, supporting reliable and controlled releases.
- Experience implementing logging, metrics and tracing for backend services, using tools such as Dynatrace, Azure Monitor, Application Insights and Grafana to inform SLIs/SLOs, reduce MTTR and improve reliability.
- Experience applying IaC to Java/.NET workloads using tools such as Terraform and Bicep to deliver repeatable, consistent and auditable environments.
- Experience operating cloud‑hosted SaaS platforms in Microsoft Azure, with a focus on resilience, autoscaling, fault tolerance and operational readiness for backend services.
- Ability to automate workflows across build, deploy, configuration, drift correction and resilience routines using Power Shell, Terraform or similar scripting.
- Experience leading incident response, performing root‑cause analysis, running post‑incident reviews and…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: