Job Senior Site Reliability Engineer,Jobs Berlin Berlin,Stellenangebote in Deutschland,IT/Informationstechnik,Talon.One LinkedIn Jobs

Talon.

One is the most powerful incentives engine that unifies loyalty, promotions and gamification into one holistic platform. Backed by enterprise-grade security and scalability, Talon.

One empowers companies to build personalized, profitable promotions and loyalty programs using any data.

Today, over 250 of the world’s most-loved brands including Adidas, Sephora and Carlsberg work with Talon.

One to drive deeper engagement and lasting loyalty with their customers.

ABOUT THE TEAM

As our Senior Site Reliability Engineer, you will own and drive reliability across the Talon.

One platform. This is a hands‑on senior role with broad impact. You will shape how we design, measure, and improve reliability across the entire engineering organization.

You will build and evolve our reliability foundations, from observability architecture and SLO frameworks to incident management and production standards. You will not only respond to incidents, but systematically eliminate their root causes. You will reduce operational toil through automation, improve signal quality across our monitoring systems, and guide engineering teams in building resilient, scalable services by design.

If you enjoy building practical systems, setting technical direction, and delivering measurable reliability improvements across a complex distributed platform, this role is for you.

ONCE YOU ARE HERE YOU WILL

Own reliability outcomes: availability, latency, error rates, and overall operational health.
Define and introduce SLOs and error budgets to establish clear reliability targets and drive engineering prioritization.
Guide the engineering organization with designs, standards, and best practices to ensure reliability and stability across the Talon.

One product.
Build and evolve observability across metrics, logs, and traces, making the system understandable, not just monitored.
Design and improve our monitoring/observability architecture end‑to‑end, including data pipelines, signal quality, alert strategy, dashboards, and SLO implementation, and cost‑aware scalability.
Eliminate operational toil by building reliability tooling and automation that reduces repetitive work and improves system resilience.
Drive structural improvements by identifying and addressing the underlying causes of incidents, not just managing their symptoms.
Lead and continuously improve incident management: on‑call readiness, severity handling, stakeholder communication, blameless post‑mortems, and strong follow‑through.
Drive continuous improvement: reduce noisy alerts, close reliability gaps, and automate recurring operational work.
Work deeply in Kubernetes and cloud environments, especially Google Cloud, and make deployments safer and more predictable.
Operate with Git Ops principles: reliability changes are versioned, reviewed, traceable, and reproducible.

WHAT WE NEED YOU TO BRING TO THE TABLE

A strong sense of ownership for production health, proactively driving improvements in stability, performance, and resilience.
The ability to establish and evolve SLO‑driven reliability practices in an organization that is building this muscle.
Strong observability instincts with a focus on signal over noise, turning metrics, logs, and traces into actionable insight through clean dashboards, meaningful alerts, and well‑defined SLOs instead of alert fatigue.
Hands‑on experience with the Grafana stack, including Prometheus, Grafana Alloy, Loki, and Tempo, with practical knowledge of pipeline design, scaling considerations, and maintaining high signal quality.
Experience designing or significantly improving monitoring and observability architectures across collection, storage, retention, cardinality control, tagging strategy, cost awareness, and ensuring the reliability of the observability stack itself.
Solid understanding of Kubernetes workloads, networking, scaling patterns, and failure modes, with real‑world experience operating systems in Google Cloud environments.
Understanding of the Open Telemetry protocol and its role in modern observability architectures.
A proactive mindset. You bring solutions, clearly articulate design options and trade‑offs, and drive initiatives through to…


Suchradius erweitern (Meilen)



Sprache der Stellenausschreibung