×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer

Job in Greater London, London, Greater London, W1B, England, UK
Listing for: Albatross
Full Time position
Listed on 2026-02-04
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Job Description & How to Apply Below
Location: Greater London

Location
:
Remote, right to work and travel in Europe.

At Albatross, we’re building the second pillar of AI: a perception layer that understands how users actually experience content, in real time. Trained on live user interactions, Albatross learns and reasons on the fly. Our technology powers real-time, in-session discovery by adapting to evolving user interests, in real-time. We have raised significant funding and our platform already operates at scale, with billions of events being processed and hundreds of millions of predictions served.

The Role

We’re looking for a Site Reliability Engineer to own the reliability and observability of our platform. This is a hands-on leadership role where you’ll design, build, and maintain our observability stack, lead incident response, oversee releases, and establish the processes and standards that allow the team to ship quickly and confidently. More specifically you will:

  • Observability & Monitoring:
    Own and evolve our observability stack (Prometheus, Grafana, Loki, Jaeger), including dashboards, alerts, and SLOs. Instrument services for meaningful metrics and tracing, reducing noise and improving signal.
  • Reliability & Incident Response:
    Lead incident response and establish blameless postmortems, runbooks, and automated remediation. Define, track, and improve SLIs/SLOs to proactively reduce reliability risk.
  • Release Management:
    Own the release process end-to-end, improving deployment speed, safety, and recovery. Implement progressive rollouts, feature flags, and rollback strategies.
  • Platform & Tooling:
    Embed observability into the development lifecycle in close collaboration with engineering. Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value.
Requirements
  • 5–7+ years in SRE, platform engineering, Dev Ops, or similar roles.
  • Strong production experience with Kubernetes and modern observability stacks (Prometheus, Grafana, Loki, Jaeger/Open Telemetry).
  • Proven track record leading incident response and building monitoring systems teams actually use.
  • Deep distributed systems knowledge and production debugging experience.
  • Pragmatic approach to tooling and alerting that teams trust.
  • Clear communicator across engineering, product, and leadership.
  • STEM degree (Computer Science, Engineering, Mathematics, or similar).
  • Plus: contributions to open-source observability projects and background in high-scale or high-availability environments.
Benefits
  • Remote-first, async-friendly culture.
  • Ownership and autonomy, you ll shape how we do reliability.
  • A team that cares about building things right.
#J-18808-Ljbffr
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary