Evaluation Reliability SRE Job Cupertino area,California USA,IT/Tech

Siri's quality signal drives every model and product decision before a release ships. But a signal is only as trustworthy as the infrastructure behind it. The Evaluation Reliability Engineering (ERE) team exists to make that infrastructure bulletproof. Within ERE, Core SRE owns the production backbone: resource management, session orchestration, on-call response, and the observability systems that surface failures before they corrupt evaluation signal.

We sit at the intersection of distributed systems, ML evaluation infrastructure, and operational excellence.

This is a senior hands-on role. You share primary on-call as part of a global follow-the-sun rotation, lead incident investigations end-to-end, and set the operational bar the rest of the team works against. You are fluent with agentic coding tools like Claude Code, Cursor, or Copilot, and use them as a force multiplier across runbook authoring, automation, and log analysis.

Experience owning or closely operating a device or VM provisioning pipeline; familiarity with virtualization-layer failure modes is a strong plus Track record of improving system reliability against measurable outcomes - uptime, MTTR, incident frequency - not just responding to incidents but eliminating their causes Incident command discipline: able to lead a multi-team incident from declaration to close-out Depth in at least one of: distributed systems reliability, device management infrastructure, evaluation or ML platform operations Demonstrated cross-team technical influence;

prior experience shaping reliability practices beyond the immediate team

5+ years of site reliability, infrastructure, or platform engineering experience with direct on-call ownership in production systems Hands-on orchestration experience (Kubernetes or equivalent): cluster health, resource management, scheduling, and failure diagnosis at scale