Site Reliability Engineer Job Glasgow area,Scotland UK,IT/Tech

Site Reliability Engineer (SRE)

Are you passionate about building resilient, scalable systems and driving operational excellence? Orion Health is looking for an experienced and proactive Site Reliability Engineer (SRE) to join our Technology team. In this role, you will be responsible for ensuring the reliability, availability, performance, and scalability of our cloud infrastructure and healthcare platforms that support millions of users worldwide.

As a Site Reliability Engineer, you will work at the intersection of software engineering and operations, applying automation, observability, and reliability engineering practices to improve platform stability, reduce operational toil, and enable development teams to deliver high-quality solutions with confidence.

What You'll Be Doing

Design, implement, and maintain reliable, scalable, and secure infrastructure that supports Orion Health's products and services.
Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and customer satisfaction.
Build and maintain observability solutions, including monitoring, logging, alerting, and tracing capabilities across cloud environments.
Participate in incident response activities, including troubleshooting, root cause analysis, remediation planning, and post-incident reviews.
Lead initiatives to reduce operational toil through automation, Infrastructure as Code (IaC), and self-service capabilities.
Collaborate closely with software engineering teams to improve application reliability, performance, and operational readiness.
Identify and eliminate reliability bottlenecks through performance tuning, capacity planning, and system optimisation.
Support infrastructure and platform upgrades, ensuring minimal disruption and maintaining service availability.
Conduct capacity forecasting and scalability planning to meet future business and customer demands.
Develop operational runbooks, standards, and best practices that improve system resilience and operational efficiency.
Champion reliability engineering principles and foster a culture of continuous improvement across teams.
Contribute to disaster recovery, business continuity, and platform resilience initiatives.

What You'll Bring To

The Role

A passion for reliability engineering, automation, and scalable cloud technologies.
Strong analytical and problem-solving skills with a focus on operational excellence.
A proactive approach to identifying risks and preventing incidents before they impact customers.
Excellent communication skills and the ability to collaborate effectively with engineering, product, and operational teams.
The ability to balance reliability, performance, security, and delivery priorities in a fast-paced environment.
A continuous improvement mindset and commitment to learning emerging technologies and industry best practices.

Experience

3+ years of experience in Site Reliability Engineering, Platform Engineering, Dev Ops, Cloud Operations, or Infrastructure Engineering roles.
Experience supporting and operating production cloud environments.
Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.
Experience implementing Infrastructure as Code (IaC) using tools such as Terraform, Bicep, ARM, or Cloud Formation.
Experience with containerisation and orchestration technologies such as Docker and Kubernetes.
Experience building and maintaining monitoring, logging, and observability solutions.
Experience managing production incidents and conducting root cause analysis.
Knowledge of CI/CD pipelines and modern software delivery practices.
Experience with automation and scripting using tools such as Power Shell, Bash, Python, or similar.
Understanding of networking, security, high availability, and disaster recovery principles.
Experience supporting highly available, customer-facing applications and services.

Skills

Site Reliability Engineering (SRE) practices and principles.
Cloud infrastructure administration and optimisation.
Infrastructure as Code (IaC).
Monitoring, observability, and alerting.
Incident management and post-incident analysis.
Capacity planning…