×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Penarth, Vale of Glamorgan, CF64, Wales, UK
Listing for: ELLIOTT MOSS CONSULTING PTE. LTD.
Per diem position
Listed on 2026-01-24
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer, Cloud Computing, IT Support
Job Description & How to Apply Below

Job Description

We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.

This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and Open Shift.

The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities
  • Reliability Engineering & SRE Practices:
    Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
  • Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
  • Proactively identify reliability risks and performance bottlenecks and drive remediation.
  • Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
  • Observability Platform Ownership:
    Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
  • Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
  • Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
  • Operate distributed tracing platforms such as Open Telemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
  • Kubernetes & Open Shift Reliability:
    Support and enable application teams to migrate workloads to newer Open Shift/Kubernetes versions.
  • Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
  • Improve platform reliability through automation, self-healing, and standardized deployment patterns.
  • Partner with developers to implement application instrumentation and reliability best practices.
  • Logging, Alerting & Incident Response:
    Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
  • Design and maintain actionable alerting aligned to SLOs and business impact.
  • Integrate alerting platforms with Pager Duty, Microsoft Teams, and other incident management tools.
  • Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
  • Dashboards & Service Visibility:
    Deploy and administer visualization tools such as Grafana and Kibana.
  • Create standardized, reusable dashboards for service health, reliability, and capacity planning.
  • Implement and manage RBAC across observability platforms.
  • Infrastructure, Security & Automation:
    Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
  • Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
  • Build and maintain CI/CD pipelines for observability and reliability tooling.
  • Extend pipelines to support multiple environments and regions with consistency and repeatability.
  • Reliability Culture & Enablement:
    Champion an SRE and observability-first culture across engineering teams.
  • Coach teams on golden signals, service health modeling, and reliability trade-offs.
  • Enable teams to move from reactive monitoring to proactive reliability engineering.
Required Skills & Experience
  • Core Technical Skills Strong hands-on experience with:
    Prometheus, Grafana;
    Elasticsearch, Kibana (cluster operations, ILM, tuning);
    Open Telemetry, Jaeger, Zipkin;
    Kubernetes & Open Shift;
    Linux OS troubleshooting; CI/CD pipelines and automation
  • Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
  • Experience supporting production, highly available, distributed systems.
  • Working Hours:

    Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.
#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary