Software Development Engineer; Elastic Kubernetes Service), EKS Scalability & Performance
Listed on 2026-05-28
-
Software Development
Software Engineer, Cloud Engineer - Software, DevOps
Description
We are looking for a Software Development Engineer to join the EKS KCP Scalability team and work on some of the hardest distributed systems problems will design, build, and operate systems that directly determine whether EKS customers — from startups to the largest AI/ML workloads on the planet — experience a reliable, performant control plane.
Description
We are looking for a Software Development Engineer to join the EKS KCP Scalability team and work on some of the hardest distributed systems problems will design, build, and operate systems that directly determine whether EKS customers — from startups to the largest AI/ML workloads on the planet — experience a reliable, performant control plane.
This is not a role where you implement features in isolation. You will work across the full stack: from the Kubernetes API server process and upstream community engagement, through autoscaling services that right-size control planes in real time, to the SLA measurement pipelines that hold us accountable to our customers. You will own systems end-to-end — from design through production operations — and your work will be measured by customer outcomes, not lines of code.
Key job responsibilities
You will build and operate the Vertical Auto-Scaling Service (VAS) and its next-generation successor (VAS 2.0), which dynamically right-sizes EKS control planes by evaluating CPU/memory utilization, etcd throttle rates, node-count thresholds, and network utilization simultaneously. You will work on the SLA measurement pipeline (Minutely
SLA → Daily
SLA → Monthly
SLA) that enforces EKS's uptime commitments, investigating breaching clusters weekly and building automation to detect and mitigate degradation before customers notice.
You will contribute to the control plane architecture for EKS Ultra clusters, defining how the API server, etcd, and associated components scale to support 100,000-node clusters running generative AI workloads. You will maintain and extend version release qualification scale tests that gate every new Kubernetes version before it reaches customers. You will engage with the upstream Kubernetes community — driving KEPs that work backwards from EKS customer requirements around performance, scale, and resiliency.
Depending on your interests and the team's priorities, you may also work on workload identity systems (IRSA, EKS Pod Identity), Cluster Access Management, EC2 capacity management and grey failure detection, or Large-Scale Event response and weight shifting.
About The Team
The EKS KCP Scalability organization owns the performance, availability, and autoscaling of the Kubernetes control plane powering Amazon EKS — from small development clusters to 100,000-node Ultra clusters running generative AI workloads. We ensure every EKS cluster operates within its contracted SLA and delivers predictable, high-performance behavior at any scale.
Our charter spans three domains:
Performance , Availability + Autoscaling , and Auth. We operate at the intersection of distributed systems, Kubernetes internals, and AWS infrastructure — building systems that scale to hundreds of thousands of clusters globally.
Basic Qualifications
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- 1+ years of software development engineer or related occupational experience
- 1+ years of designing and developing large-scale, multi-tiered, multi-threaded, embedded or distributed software applications, tools, systems, and services using: C#, C++, Java, or Perl experience
- 1+ years of Object Oriented Design experience
- Bachelor's degree or foreign equivalent in Computer Science, Engineering, Mathematics, or a related field
- Experience programming with at least one software programming language
Preferred Qualifications
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Bachelor's degree in computer science or equivalent
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).