Site Reliability Engineer — Info Apps
Listed on 2026-02-21
-
IT/Tech
Systems Engineer, SRE/Site Reliability, Cloud Computing
Austin, Texas, United States Software and Services
Do you love building and scaling infrastructure that delights millions of customers? At Apple, we believe reliability is a feature. We are looking for a Site Reliability Engineer to join our team in overseeing the performance and availability of our core backend services in News, Stocks, Weather, Books and Creator Studio applications.
As a SRE, you won’t just be responding to alerts; you will be shaping the evolution of our observability strategy, a mentor for incident management, and a champion for automation. You will help us refine our "Golden Signals" and ensure our Kubernetes-based ecosystem remains world-class.
In this role, you will be a key pillar of our engineering organization, ensuring that our services remain highly available and performant. Your impact will include:
System Architecture:
Designing and implementing the next generation of our telemetry and alerting systems.
Reliability Engineering:
Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience.
Operational Excellence:
Reducing operational load through software; if you have to do it twice, you’ll want to automate it.
Collaboration:
Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle.
Mentorship:
Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.
- Experience:
5+ years in SRE, Dev Ops, or Infrastructure roles with a proven track record of managing high-traffic, internet-facing production environments. - Kubernetes Expertise:
Deep experience building and operating container orchestration systems (EKS/GKE/Vanilla K8s). You should be comfortable troubleshooting from the networking layer up to the application pod. - Observability Champion:
Expert knowledge of the 4 Golden Signals (Latency, Traffic, Errors, and Saturation). Proficiency with tools like Prometheus, Grafana, and Splunk is essential. - Cloud Proficiency:
Hands-on experience designing and maintaining resilient infrastructure on public cloud providers (AWS, GCP, or Azure). - Scripting & Automation:
Strong ability to code at a scripting level (Python or Go preferred) to automate toil and build self-healing systems. - Incident Leadership:
Experience leading incident response, performing Root Cause Analysis (RCA), and implementing blameless post-mortems to improve system resilience. - Infrastructure as Code:
Proficient in Terraform, Cloud Formation, or Pulumi to manage immutable infrastructure.
- Search & Data:
Specialized experience operating and tuning Solr or Elasticsearch at scale. - Networking:
Strong understanding of TCP/IP, Load Balancing (ELB/ALB), and Service Mesh (Istio/Linkerd). - Data Systems:
Experience with Kafka, Cassandra, or Postgres in a distributed environment.
Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant .
Apple accepts applications to this posting on an ongoing basis.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).