Head of Delivery Job Penn Yan area,Town of Italy New York USA,IT/Tech

Location: Town of Italy

Overview

Location
:
Remote, right to work and travel in Europe.

Albatross
:
At Albatross, we're building the second pillar of AI: a perception layer that understands how users actually experience content, in real time. Trained on live user interactions, Albatross learns and reasons on the fly. Our technology powers real-time, in-session discovery by adapting to evolving user interests, in real-time. We have raised significant funding and our platform already operates at scale, with billions of events being processed and hundreds of millions of predictions served.

The Role

We're looking for a Site Reliability Engineer to own the reliability and observability of our platform. This is a hands-on leadership role where you'll design, build, and maintain our observability stack, lead incident response, oversee releases, and establish the processes and standards that allow the team to ship quickly and confidently. More specifically you will:

Observability & Monitoring:
Own and evolve our observability stack (Prometheus, Grafana, Loki, Jaeger), including dashboards, alerts, and SLOs. Instrument services for meaningful metrics and tracing, reducing noise and improving signal
Reliability & Incident Response:
Lead incident response and establish blameless postmortems, runbooks, and automated remediation. Define, track, and improve SLIs/SLOs to proactively reduce reliability risk
Release Management:
Own the release process end-to-end, improving deployment speed, safety, and recovery. Implement progressive rollouts, feature flags, and rollback strategies
Platform & Tooling:
Embed observability into the development lifecycle in close collaboration with engineering. Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value

Requirements

5-7+ years in SRE, platform engineering, Dev Ops, or similar roles
Strong production experience with Kubernetes and modern observability stacks (Prometheus, Grafana, Loki, Jaeger/Open Telemetry)
Proven track record leading incident response and building monitoring systems teams actually use
Deep distributed systems knowledge and production debugging experience
Pragmatic approach to tooling and alerting that teams trust
Clear communicator across engineering, product, and leadership
STEM degree (Computer Science, Engineering, Mathematics, or similar)
Plus: contributions to open-source observability projects and background in high-scale or high-availability environments

Benefits

Remote-first, async-friendly culture
Ownership and autonomy, you'll shape how we do reliability
A team that cares about building things right

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language