More jobs:
Site Reliability Engineer, Enterprise Technology Services
Job in
Sunnyvale, Santa Clara County, California, 94085, USA
Listed on 2026-06-01
Listing for:
Apple Inc.
Full Time
position Listed on 2026-06-01
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing
Job Description & How to Apply Below
Your work will be pivotal in powering services across Apple, partnering with engineering teams to deliver seamless experiences.
This role involves managing one of the largest Identity Management Platform services for a vast customer base across various devices and services. Key responsibilities include overseeing critical services such as device provisioning, authentication, token management, and security. A primary objective is ensuring the high availability and reliability of the system to facilitate critical authentication and authorization transactions, user provisioning, purchases, subscriptions, and account lifecycle management (creation, management, and recovery).
This also entails maintaining platform security by blocking and rate-limiting fraud traffic at the perimeter, and ensuring high data consistency and replication across multiple data centers through custom mechanisms. The role covers managing infrastructure, capacity planning, disaster recovery, and auto-failover mechanisms. It also involves monitoring infrastructure and application services, driving incident management for internal and external stakeholders, and defining system and functional observability.
Furthermore, this position helps teams overcome system bottlenecks and architectural challenges for efficiency improvements, ensures systems are compliant with industry standards and pass critical audits, and drives automation solutions for large-scale platform service needs. Advanced responsibilities include alert engineering, anomaly detection with Machine Learning tools, and adapting to Generative AI enhancements. Investigating device-related issues by debugging relevant logs is also part of the role, alongside managing the full system lifecycle, including configuration and code deployment in user acceptance test and production environments.
Observability u0026 SRE Principles:
Experience with monitoring and logging tools (e.g., Prometheus, Splunk, Grafana, Open Telemetry) and a strong understanding of SRE principles, including observability, error budgeting, and service reliability metrics (SLA, SLO, SLI). CI/CD u0026 Automation:
Proficiency with CI/CD, Release Engineering, Dev Ops practices, and source control (Git). Experience designing and implementing CI/CD pipelines and Infrastructure as Code (Helm, CRD). Programming u0026 Data Systems:
Strong programming skills in languages like Java, Python, Go, etc.
Experience with various databases (Relational, No
SQL, OLAP) and event-driven architectures (Kafka, Rabbit
MQ). Reliability u0026 Operations:
Experience with on-call, including incident/problem management (PIR, RCA) and a strong sense of ownership for system reliability. Security u0026 Compliance:
Understanding of security standards, policies, cryptography, and authentication (OAuth, SAML, SSO). Knowledge of Governance and Compliance. Innovation u0026
Collaboration:
Passion for designing reliable systems, advocating for automation, and a desire to collaborate effectively. Experience leveraging ML/GenAI for operational efficiency is a plus. Certification:
Cybersecurity certification will be an added advantage.
Education:
Bachelor's or Master's degree in Computer Science or equivalent practical experience.
5+ years of experience in Site Reliability Engineering with a strong focus on building, scaling, and operating large-scale distributed platform services, and Java. BS degree in computer science or equivalent field with 7+ years of experience or MS degree in computer science or equivalent field with 5+ years of experience. Strong technical grasp and experience working on Open Source technologies designed for large-scale data processing.
Experience designing, analyzing, and troubleshooting distributed systems. Proficiency in at least one programming or scripting language (Python, Java, Go, Bash, Ansible, or similar). Experience designing observability stacks (Prometheus, Grafana, Datadog, Open Telemetry, ELK, etc.). Excellent troubleshooting and problem-solving skills.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×