Site Reliability Engineer,Enterprise Technology Services Job Sunnyvale area,California USA,IT/Tech

Site Reliability Engineer, Enterprise Technology Services

At Apple, groundbreaking ideas quickly transform into extraordinary products and services that delight millions worldwide. If you're passionate about engineering and operating robust, large-scale systems, imagine the impact you could make. The Identity Management Services (IdMS) SRE team is seeking a Service Reliability Engineer (SRE) to design, build tools for, and support our critical platform services. We're looking for someone with strong software development skills, deep systems expertise, and a solid understanding of SRE principles, ready to ensure operational precision at Apple's immense scale.

Your work will be pivotal in powering services across Apple, partnering with engineering teams to deliver seamless experiences.

This role involves managing one of the largest Identity Management Platform services for a vast customer base across various devices and services. Key responsibilities include overseeing critical services such as device provisioning, authentication, token management, and security. A primary objective is ensuring the high availability and reliability of the system to facilitate critical authentication and authorization transactions, user provisioning, purchases, subscriptions, and account lifecycle management (creation, management, and recovery).

This also entails maintaining platform security by blocking and rate-limiting fraud traffic at the perimeter, and ensuring high data consistency and replication across multiple data centers through custom mechanisms. The role covers managing infrastructure, capacity planning, disaster recovery, and auto-failover mechanisms. It also involves monitoring infrastructure and application services, driving incident management for internal and external stakeholders, and defining system and functional observability.

Furthermore, this position helps teams overcome system bottlenecks and architectural challenges for efficiency improvements, ensures systems are compliant with industry standards and pass critical audits, and drives automation solutions for large-scale platform service needs. Advanced responsibilities include alert engineering, anomaly detection with Machine Learning tools, and adapting to Generative AI enhancements. Investigating device-related issues by debugging relevant logs is also part of the role, alongside managing the full system lifecycle, including configuration and code deployment in user acceptance test and production environments.

The responsibilities include:

Drive Platform Reliability & SRE Standards:
Lead the optimization of a large-scale Identity Management Platform, ensuring ultra-high availability, reliability, and performance for critical authentication, authorization, and provisioning services. Define and implement robust Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) to guide engineering teams toward ambitious reliability and observability goals.
Architect & Engineer Resilient Systems:
Design, build, and manage robust, distributed systems across cloud and on-premise infrastructure. Develop advanced capacity planning, disaster recovery, auto-failover, and data consistency mechanisms. Innovate by creating reusable tooling, automation frameworks, and advanced reliability platforms covering observability, alerting, chaos testing, auto-scaling, and failover strategies.
Lead Operational Excellence & Incident Management:
Drive comprehensive operational excellence through advanced observability (tracing, logging, metrics, alerting) and next-generation telemetry, leveraging Machine Learning for anomaly detection and exploring GenAI for alert engineering. Lead technical response during major incidents, conducting deep post-mortems, driving systemic improvements, and embedding preventive architectural controls.
Champion Automation & Resilience Engineering:
Develop and implement large-scale automation solutions, internal tooling, and frameworks to enhance reliability, cost-efficiency, and operational visibility. Advance resilience engineering by integrating automation pipelines, CI/CD, canary releases, and chaos engineering principles into core development and deployment workflows. Drive initiatives to eliminate toil and contribute to multi-cloud strategy.
Ensure Security & Compliance:
Maintain the highest security posture, implementing fraud prevention at the perimeter, and ensuring strict adherence to industry compliance standards (e.g., ISO-27001, PCI). Uphold all architectural and operational practices to rigorously meet security standards, compliance requirements, and audit readiness protocols.
Foster

Cross-Functional Collaboration:

Partner extensively with engineering, production support, and QA teams to ensure seamless service delivery. Promote a strong Dev Ops culture and provide technical insights through log analysis and system debugging.

Minimum qualifications include:

5+ years of experience in Site Reliability Engineering with a strong focus on…