Site Reliability Engineer Job Penarth area,Wales UK,IT/Tech

Job Description

We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.

This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and Open Shift.

The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities

Reliability Engineering & SRE Practices:
Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
Proactively identify reliability risks and performance bottlenecks and drive remediation.
Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
Observability Platform Ownership:
Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
Operate distributed tracing platforms such as Open Telemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
Kubernetes & Open Shift Reliability:
Support and enable application teams to migrate workloads to newer Open Shift/Kubernetes versions.
Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
Improve platform reliability through automation, self-healing, and standardized deployment patterns.
Partner with developers to implement application instrumentation and reliability best practices.
Logging, Alerting & Incident Response:
Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
Design and maintain actionable alerting aligned to SLOs and business impact.
Integrate alerting platforms with Pager Duty, Microsoft Teams, and other incident management tools.
Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
Dashboards & Service Visibility:
Deploy and administer visualization tools such as Grafana and Kibana.
Create standardized, reusable dashboards for service health, reliability, and capacity planning.
Implement and manage RBAC across observability platforms.
Infrastructure, Security & Automation:
Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
Build and maintain CI/CD pipelines for observability and reliability tooling.
Extend pipelines to support multiple environments and regions with consistency and repeatability.
Reliability Culture & Enablement:
Champion an SRE and observability-first culture across engineering teams.
Coach teams on golden signals, service health modeling, and reliability trade-offs.
Enable teams to move from reactive monitoring to proactive reliability engineering.

Required Skills & Experience

Core Technical Skills Strong hands-on experience with:
Prometheus, Grafana;
Elasticsearch, Kibana (cluster operations, ILM, tuning);
Open Telemetry, Jaeger, Zipkin;
Kubernetes & Open Shift;
Linux OS troubleshooting; CI/CD pipelines and automation
Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
Experience supporting production, highly available, distributed systems.
Working Hours:

Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language