Senior Software Engineer, Machine Learning Platform Technologies – Traffic Infrastructure
Listed on 2026-01-06
-
IT/Tech
Systems Engineer
Senior Software Engineer, Machine Learning Platform Technologies – Traffic Infrastructure
Cupertino, California, United States Machine Learning and AI
Are you an expert in large-scale networking and traffic infrastructure with a passion for building next-generation platforms for machine learning systems? We’re seeking a hands-on technical leader with deep expertise in Envoy, Istio service mesh, L4/L7 load balancing, and modern internet protocols (HTTP/2, gRPC, HTTP/3) to design and scale traffic platforms that power Apple’s Search and ML ecosystems. If you’ve contributed to CNCF or networking projects such as Envoy, Istio, Kubernetes networking, or related data-plane technologies, and you’re excited about building capacity-aware, metrics-driven traffic systems for ML inference and training, this role offers the opportunity to architect at Apple scale—delivering highly performant, resilient, and intelligent traffic infrastructure supporting billions of requests.
DescriptionThe MLPT Traffic Infrastructure Team within Apple’s Services organization builds the foundational networking and traffic management platforms that power Search and large-scale ML workloads. Our focus is on designing modern L4/L7 traffic systems that intelligently route, balance, and optimize requests across heterogeneous compute environments—including GPU-backed inference services and multi-cloud deployments. We are reimagining traffic infrastructure as a programmable, metrics-driven, and capacity-aware platform, leveraging Envoy-based data planes, Istio service mesh, and dynamic control planes to support low-latency, high-throughput ML workloads.
You’ll work closely with ML engineers, SREs, and platform teams to enable secure, observable, and adaptive request routing for both server-to-server and client-to-server use cases.
- Architect and build L4/L7 traffic platforms for ML training and inference using Envoy, Istio, and modern load-balancing techniques.
- Design and implement dynamic, capacity-aware, and metrics-driven load balancing strategies for HTTP, gRPC, and streaming ML inference workloads.
- Develop and optimize service mesh architectures for high-throughput, low-latency ML systems, including multi-cluster and multi-region topologies.
- Lead the evolution of client-to-server and server-to-server traffic patterns, including adoption of HTTP/3 where appropriate.
- Collaborate with ML and platform teams to support scalable inference, A/B traffic shifting, canarying, and adaptive routing strategies.
- Contribute to and upstream improvements in Envoy, Istio, Kubernetes networking, or related CNCF projects, representing Apple in the open-source community.
- Implement observability, telemetry, and debugging frameworks for traffic flows (latency, tail behavior, retries, back pressure, saturation).
- Ensure traffic platforms are secure, resilient, and cost-efficient, supporting hybrid and multi-cloud environments at global scale.
- Mentor engineers and drive architectural decisions across networking and traffic-infra domains.
- BS/MS in Computer Science or equivalent practical experience.
- 5+ years of experience in distributed systems, networking, or traffic infrastructure engineering.
- Strong programming experience in Golang and Python, especially for control-plane or data-plane systems.
- Deep expertise in L4/L7 networking concepts, including load balancing, connection management, retries, timeouts, and congestion control.
- Hands-on experience with Envoy, Istio, or similar service mesh / proxy technologies.
- Strong understanding of HTTP/1.1, HTTP/2, gRPC, and modern transport protocols.
- Experience designing and operating high-throughput, low-latency systems in production.
- Proven ability to lead complex technical initiatives and mentor engineers.
- 9+ years in networking, traffic infrastructure, or large-scale distributed systems roles.
- Contributions to CNCF or networking open-source projects (Envoy, Istio, Kubernetes networking, eBPF, etc.).
- Experience with HTTP/3, QUIC, or next-generation transport protocols.
- Strong understanding of capacity-based routing, adaptive load balancing, and feedback-driven…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).