More jobs:
Site Reliability Engineer - Info Apps
Job in
Austin, Travis County, Texas, 78719, USA
Listed on 2026-06-02
Listing for:
Apple Inc.
Full Time
position Listed on 2026-06-02
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing, SRE/Site Reliability
Job Description & How to Apply Below
you will be shaping the evolution of our observability strategy, a mentor for incident management, and a champion for automation. You will help us refine our "Golden Signals" and ensure our Kubernetes-based ecosystem remains world-class.
In this role, you will be a key pillar of our engineering organization, ensuring that our services remain highly available and performant. Your impact will include:
System Architecture:
Designing and implementing the next generation of our telemetry and alerting systems. Reliability Engineering:
Defining SLOs/SLIs and ensuring our monitoring strategy captures the true health of the user experience. Operational Excellence:
Reducing operational load through software; if you have to do it twice, you'll want to automate it.
Collaboration:
Partnering with App Dev teams to influence the "design for reliability" phase of the software development lifecycle. Mentorship:
Acting as a technical lead for junior members and off-shore partners, providing guidance on runbook development and disaster recovery.
Search u0026 Data:
Specialized experience operating and tuning Solr or Elasticsearch working:
Strong understanding of TCP/IP, Load Balancing (ELB/ALB), and Service Mesh (Istio/Linkerd). Data Systems:
Experience with Kafka, Cassandra, or Postgres in a distributed environment.
Experience:
5+ years in SRE, Dev Ops, or Infrastructure roles with a proven track record of managing high-traffic, internet-facing production environments. Kubernetes Expertise:
Deep experience building and operating container orchestration systems (EKS/GKE/Vanilla K8s). You should be comfortable troubleshooting from the networking layer up to the application pod. Observability Champion:
Expert knowledge of the 4 Golden Signals (Latency, Traffic, Errors, and Saturation). Proficiency with tools like Prometheus, Grafana, and Splunk is essential. Cloud Proficiency:
Hands-on experience designing and maintaining resilient infrastructure on public cloud providers (AWS, GCP, or Azure). Scripting u0026 Automation:
Strong ability to code at a scripting level (Python or Go preferred) to automate toil and build self-healing systems. Incident Leadership:
Experience leading incident response, performing Root Cause Analysis (RCA), and implementing blameless post-mortems to improve system resilience. Infrastructure as Code:
Proficient in Terraform, Cloud Formation, or Pulumi to manage immutable infrastructure.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×