Engineering Manager, Cloud Network Reliability
Listed on 2026-02-16
-
IT/Tech
Systems Engineer, SRE/Site Reliability
Sunnyvale, California, United States Software and Services
Apple Cloud Networking team builds and operates large-scale, software-defined networking platforms that enable secure, resilient, and highly available multi-cloud connectivity with a global footprint. Our infrastructure powers critical Apple services, including iCloud, iTunes, Siri, and Maps.
We are seeking an experienced and visionary Reliability Engineering Manager to lead and grow a team of engineers focused on ensuring the availability, performance, scalability, and resiliency of Apple’s global network services. In this role, you will work closely with software engineering, infrastructure, and operations teams across Apple to deliver reliable, fault-tolerant systems that operate at massive scale.
As a key leader within the Cloud Networking organization, you will define and drive the reliability and resiliency strategy for Apple’s network platform services. You will be responsible for building, scaling, and mentoring a high-performing Production Engineering team that champions SRE and SWE best practices, release engineering, and data-driven decision-making. You will establish strong cross-functional partnerships to ensure reliability and resiliency are embedded throughout the system lifecycle—from design and development to deployment and operations.
Your leadership will help ensure Apple’s network services meet demanding availability, latency, resilience, and security requirements while continuously improving operational maturity. We are looking for a leader who is deeply passionate about operating mission-critical, globally distributed systems, preventing outages, learning from failures, and driving long-term reliability improvements.
- Define and execute the reliability engineering vision, strategy, and roadmap for Apple’s cloud networking platforms.
- Lead day-to-day execution of reliability initiatives, including sprint planning, prioritization, and retrospectives with a strong focus on operational outcomes.
- Establish and own SLIs, SLOs, and error budgets, using metrics and observability to guide engineering trade-offs and reliability investments.
- Partner closely with software engineering, infrastructure, and security teams to design and operate fault-tolerant, resilient systems.
- Promote automation and operational efficiency through tooling, testing, deployment pipelines, and self-healing systems.
- Mentor and develop engineers through regular one-on-ones, career planning, and performance feedback, fostering a culture of ownership and continuous improvement.
- Collaborate with recruiting to attract and hire top reliability engineering talent.
- Advocate for reliability, resilience, and operational excellence across multiple product and platform teams.
- 10+ years of experience in software engineering, systems engineering, or infrastructure engineering.
- 6+ years of experience in a technical leadership role with people management responsibilities.
- Strong background in designing, operating, and supporting highly available, fault-tolerant distributed systems at scale.
- Hands-on experience with reliability engineering, SRE, or large-scale production operations.
- Solid understanding of network infrastructure and software-defined networking (SDN).
- Ability to lead cross-functional collaboration and influence technical decisions across teams.
- Experience in defining and operating SLO-based reliability and resiliency programs.
- Strong knowledge of observability systems (metrics, logging, tracing) and qualification engineering.
- Experience with microservices architectures, RESTful APIs, and cloud-native platforms.
- In-depth understanding of networking protocols, routing mechanisms, and traffic management.
- Broad knowledge of networking solutions across the OSI layers 3 through 7.
- Excellent written and verbal communication skills with the ability to clearly articulate risk, reliability trade-offs, and operational priorities.
- Proven ability to manage competing priorities, drive initiatives to completion, and deliver results in fast-paced environments.
At Apple, base pay is one part of our total…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).