Senior Site Reliability Engineer Job Zurich area,Zürich Kanton Zürich Switzerland,IT/Tech

Location: Zürich

Senior Site Reliability Engineer

Apply for the Senior Site Reliability Engineer role at Caffeine
.

, we are building the world’s first platform to create full‑stack, on‑chain applications through natural language. Our mission is to make building software as simple as a conversation — transforming ideas into live applications instantly. We are a cross‑functional team of engineers and researchers building the AI that powers this new paradigm. To do this, we need world‑class product engineers who can design beautiful, reliable, and performant experiences across the stack.

About

the Role

As a Senior Site Reliability Engineer, you will be the guardian of the Caffeine.ai user experience. You are not just keeping servers online; you are ensuring the end‑to‑end reliability of the core “idea‑to‑application” journey. Your focus will be on the availability, reliability, and scalability of our user‑facing products and the complex AI‑driven micro‑services that power them. You will be deeply embedded with our product and engineering teams, acting as the critical bridge between our ambitious AI vision and a rock‑solid production reality.

This is a hands‑on role for an engineer who thinks about reliability from the user’s perspective and wants to provide the best developer experience for your fellow engineers, and wants to solve novel challenges in a rapidly evolving AI/ML environment.

What You’ll Do

Own Product Reliability: Take ownership of the availability and reliability of the Caffeine.ai platform. You'll define our Service Level Objectives (SLOs), provide a reliable Continuous Delivery (CD) platform and work across teams to meet and exceed them.
Build Deep Product Insight: Design, implement, and manage our observability stack (Datadog, Open telemetry, distributed tracing, logs, metrics) to provide high‑fidelity signals into the health of our services and, most importantly, the user experience.
Engineer Scalable Solutions: Dive deep into our architecture to identify and eliminate performance bottlenecks, single points of failure, and sources of toil. You'll write code—primarily in Rust, Go and Typescript (we use Pulumi)—to automate operations and build robust, self‑healing systems. You will set up routing and service mesh configurations (e.g. Istio).
Champion Reliability from Day One: Partner with software engineers during design and code reviews to proactively bake in reliability, scalability, and operability. You will be the expert voice that helps the team build for production from the start.
Lead and Learn from Incidents: Coordinate the incident response process for our production services. You'll lead blameless post‑mortems that drive meaningful improvements across our systems and processes.
Participate in an On‑Call Rotation: As a key member of the team, you will be part of a compensated on‑call rotation focused on coordinating incident response and ensuring platform stability.

Who You Are

You are a product‑minded engineer with proven experience as a Site Reliability Engineer, with a strong focus on user‑facing applications and distributed service architectures.
You have deep expertise in building and running modern observability stacks (e.g., Datadog, Open telemetry) and believe in data‑driven decision‑making.
You are a proficient software developer. You have experience designing and writing production‑grade applications and automation, ideally in a systems and infra language like Rust or Go
, and are open to use Python
, Typescript or Bash.
You are a methodical troubleshooter, capable of systematically diagnosing complex issues across the entire stack, from networking protocols (TCP/IP, DNS, TLS) up to the application layer.
You understand the complexities of modern CI/CD pipelines and have experience building and maintaining them.
You thrive in a collaborative environment and possess excellent communication skills, capable of explaining complex technical concepts to a diverse audience.

Bonus

You have experience with the reliability and performance challenges of AI/ML‑powered systems or large‑scale data processing pipelines.

This is a hybrid role based in our Zurich office, with a requirement of 3+ days in the office per week.

Seniority Level

Mid‑Senior level

Employment Type

Full‑time

Industries

Software Development and Technology, Information and Internet

No additional EEO statement present.

#J-18808-Ljbffr


Increase/decrease your Search Radius (miles)



Job Posting Language