Site Reliability Engineer Job Glasgow area,Scotland UK,IT/Tech

Are you passionate about building resilient, scalable systems and driving operational excellence? Orion Health is looking for an experienced and proactive Site Reliability Engineer (SRE) to join our Technology team. In this role, you will be responsible for ensuring the reliability, availability, performance, and scalability of our cloud infrastructure and healthcare platforms that support millions of users worldwide.

As a Site Reliability Engineer, you will work at the intersection of software engineering and operations, applying automation, observability, and reliability engineering practices to improve platform stability, reduce operational toil, and enable development teams to deliver high-quality solutions with confidence.

What You ll Be Doing

As a Site Reliability Engineer, you will play a critical role in maintaining and evolving Orion Health s cloud infrastructure and operational platforms. You will help define and implement reliability standards, improve system observability, automate operational processes, and lead efforts to enhance platform resilience.

Design, implement, and maintain reliable, scalable, and secure infrastructure that supports Orion Health s products and services.
Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and customer satisfaction.
Build and maintain observability solutions, including monitoring, logging, alerting, and tracing capabilities across cloud environments.
Participate in incident response activities, including troubleshooting, root cause analysis, remediation planning, and post-incident reviews.
Lead initiatives to reduce operational toil through automation, Infrastructure as Code (IaC), and self-service capabilities.
Collaborate closely with software engineering teams to improve application reliability, performance, and operational readiness.
Identify and eliminate reliability bottlenecks through performance tuning, capacity planning, and system optimisation.
Support infrastructure and platform upgrades, ensuring minimal disruption and maintaining service availability.
Conduct capacity forecasting and scalability planning to meet future business and customer demands.
Develop operational runbooks, standards, and best practices that improve system resilience and operational efficiency.
Champion reliability engineering principles and foster a culture of continuous improvement across teams.
Contribute to disaster recovery, business continuity, and platform resilience initiatives.

What You ll Bring to the Role

A passion for reliability engineering, automation, and scalable cloud technologies.
Strong analytical and problem-solving skills with a focus on operational excellence.
A proactive approach to identifying risks and preventing incidents before they impact customers.
Excellent communication skills and the ability to collaborate effectively with engineering, product, and operational teams.
The ability to balance reliability, performance, security, and delivery priorities in a fast-paced environment.
A continuous improvement mindset and commitment to learning emerging technologies and industry best practices.

Experience

To succeed in this role, you will ideally have:

3+ years of experience in Site Reliability Engineering, Platform Engineering, Dev Ops, Cloud Operations, or Infrastructure Engineering roles.
Experience supporting and operating production cloud environments.
Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.
Experience implementing Infrastructure as Code (IaC) using tools such as Terraform, Bicep, ARM, or Cloud Formation.
Experience with containerisation and orchestration technologies such as Docker and Kubernetes.
Experience building and maintaining monitoring, logging, and observability solutions.
Experience managing production incidents and conducting root cause analysis.
Knowledge of CI/CD pipelines and modern software delivery practices.
Experience with automation and scripting using tools such as Power Shell, Bash, Python, or similar.
Understanding of networking, security, high availability, and disaster recovery principles.
Ex…