Evaluation Reliability SRE Job Cupertino area,California USA,IT/Tech

** Weekly

Hours:

** 40

** Role Number:*
* ** Summary*
* Siri's quality signal drives every model and product decision before a release ships. But a signal is only as trustworthy as the infrastructure behind it.

The Evaluation Reliability Engineering (ERE) team exists to make that infrastructure bulletproof. Within ERE, Core SRE owns the production backbone: resource management, session orchestration, on-call response, and the observability systems that surface failures before they corrupt evaluation signal. We sit at the intersection of distributed systems, ML evaluation infrastructure, and operational excellence.

** Description*
* This is a senior hands-on role. You share primary on-call as part of a global follow-the-sun rotation, lead incident investigations end-to-end, and set the operational bar the rest of the team works against. You are fluent with agentic coding tools like Claude Code, Cursor, or Copilot, and use them as a force multiplier across runbook authoring, automation, and log analysis.

** Minimum Qualifications*
* + 5+ years of site reliability, infrastructure, or platform engineering experience with direct on-call ownership in production systems

+ Hands-on orchestration experience (Kubernetes or equivalent): cluster health, resource management, scheduling, and failure diagnosis at scale

** Preferred Qualifications*
* + Experience owning or closely operating a device or VM provisioning pipeline; familiarity with virtualization-layer failure modes is a strong plus

+ Track record of improving system reliability against measurable outcomes - uptime, MTTR, incident frequency - not just responding to incidents but eliminating their causes

+ Incident command discipline: able to lead a multi-team incident from declaration to close-out

+ Depth in at least one of: distributed systems reliability, device management infrastructure, evaluation or ML platform operations

+ Demonstrated cross-team technical influence; prior experience shaping reliability practices beyond the immediate team