Reliability Engineer
Listed on 2025-12-17
-
IT/Tech
Systems Engineer, Cloud Computing
SENIOR SITE RELIABILITY ENGINEER (SRE) – PORTSMOUTH, HANTS, UK
Are you a Reliability Engineer who loves problem‑solving, analysing complex data and improving the long-term performance of critical assets? If you're excited by the idea of using real‑time asset health insights to reduce downtime and optimise maintenance strategies, this role could be your perfect next step.
We’re working with a major organisation responsible for a large, highly regulated UK infra. The role is based in Portsmouth, Hampshire, UK but offers significant remote flexibility. It is crucial for ensuring the availability, performance, scalability and reliability of our production systems and services. The successful candidate will be responsible for automating operational tasks, developing monitoring and alerting systems, responding to incidents and driving improvements in system stability and efficiency.
You will work on building and maintaining robust infrastructure, implementing Infrastructure as Code (IaC) using tools such as Terraform or Ansible, and managing CI/CD pipelines. A key aspect of the role involves collaborating with development teams to ensure services are designed for reliability and operability from the outset. You will be involved in capacity planning, performance tuning and disaster recovery strategies.
Experience with container orchestration platforms such as Kubernetes is highly desirable, and the role requires a proactive approach to identifying potential issues, a strong understanding of networking and the ability to troubleshoot complex system problems under pressure. You will play a key role in fostering a culture of reliability and operational excellence within the engineering organization.
- Ensure the high availability, performance, and scalability of production systems.
- Automate infrastructure provisioning, configuration, and deployment using IaC tools.
- Develop and maintain robust monitoring, alerting and logging systems.
- Respond to and resolve production incidents, performing root cause analysis.
- Collaborate with development teams to improve service reliability and operability.
- Implement and manage CI/CD pipelines for efficient software delivery.
- Conduct capacity planning and performance tuning.
- Develop and test disaster recovery and business continuity plans.
- Manage and optimise containerised environments (e.g., Kubernetes).
- Contribute to architectural decisions related to system design and reliability.
- Bachelor's or Master's degree in Computer Science, Engineering or a related field, or equivalent experience.
- Minimum of 5 years of experience in Site Reliability Engineering, Dev Ops or Systems Engineering.
- Strong experience with cloud platforms (AWS, Azure or GCP).
- Proficiency in at least one scripting or programming language (e.g. Python, Bash, Go).
- Experience with Infrastructure as Code tools (e.g. Terraform, Ansible, Chef, Puppet).
- Solid understanding of Linux/Unix operating systems.
- Experience with containerisation and orchestration technologies (Docker, Kubernetes).
- Knowledge of networking concepts and protocols.
- Experience with monitoring and logging tools (e.g. Prometheus, Grafana, ELK stack).
- Excellent problem‑solving and troubleshooting skills.
Purpose of the Role
The post holder will lead and oversee the NATS Risk Management Framework, including the policy, processes, risk register (Riskonnect), reporting portals and associated documents and tools. The role is responsible for ensuring the effective identification, evaluation, management and reporting of risk across the organisation.
The post holder will be accountable for implementing the Risk Management framework and delivering Risk Management Capability, Assurance and Risk Reporting. Key stakeholders include the Chairman of the Audit Committee, Audit Committee members, the NATS Legal Director, Executive Directors and their direct reports. The post holder will collaborate with the NATS Executive and Senior Leadership Team to integrate risk considerations into strategic planning and business objectives.
KeyAccountabilities
- Own the NATS Risk Management Policy, Process, Risk…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: