More jobs:
Software Development Engineer, ECInstance Networking
Job in
Santa Clara, Santa Clara County, California, 95053, USA
Listed on 2026-05-10
Listing for:
Amazon Web Services (AWS)
Full Time
position Listed on 2026-05-10
Job specializations:
-
Software Development
Software Engineer, DevOps, Cloud Engineer - Software
Job Description & How to Apply Below
Description
Join our team building the scale‑out networking backbone that powers the world's largest AI training clusters. We develop high‑performance RDMA and RoCE solutions that enable distributed training of trillion‑parameter models across thousands of compute nodes on AWS infrastructure.
Key Responsibilities- Design and develop high‑performance networking software solutions utilizing RDMA and RoCE technologies for large‑scale AI clusters
- Integrate Smart
NIC acceleration hardware with EC2 control plane systems and APIs - Implement and optimize collective communication patterns for distributed AI training workloads
- Develop comprehensive performance monitoring, metrics collection, and benchmarking tools for high‑bandwidth cluster interconnects
- Create automated testing frameworks and stress‑testing tools for multi‑rack distributed systems
- Debug complex system‑level issues across hardware acceleration, kernel networking, and distributed applications
- Collaborate on architecture decisions for next‑generation scale‑out AI infrastructure
- Participate in design reviews, code reviews, and technical documentation
- 3+ years of non‑internship professional software development experience
- 2+ years of non‑internship design or architecture experience for new and existing systems
- Strong programming skills in C/C++ with a focus on high‑performance systems
- Experience with RDMA technologies and RoCE implementations
- Familiarity with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
- Experience with Linux networking, kernel development, and distributed systems
- Understanding of high‑performance computing clusters and parallel programming
- 3+ years of full software development life cycle experience, including coding standards, code reviews, source control management, build processes, testing, and operations
- Bachelor's degree in computer science or equivalent
- Experience with Smart
NIC programming and network acceleration hardware APIs - Knowledge of large‑scale AI training infrastructure and multi‑rack cluster networking
- Experience with performance optimization, benchmarking, and system‑level debugging
- Understanding of AI accelerator architectures and scale‑out communication patterns
- Experience with cloud infrastructure integration and virtualization technologies
- Strong problem‑solving skills and experience with complex distributed systems
- Proficiency in design and analysis of algorithms and data structures
- Linux operating system knowledge
- In‑depth knowledge of TCP/IP
- Kernel or embedded development, particularly Linux kernel
- Strong knowledge of Computer Science fundamentals in data structures, algorithm design, problem solving, and complexity analysis
- Knowledge of at least one modern programming language such as C, C++, Rust, Python, or Perl
- Experience developing complex software systems that have been successfully delivered to customers
- Knowledge of professional software engineering practices and best practices for the full software development life cycle
- Ability to take a project from scoping requirements through actual launch of the project
- Experience in communicating with users, other technical teams, and management to collect requirements, describe software product features, and technical designs
- Experience mentoring junior software development engineers and driving engineering excellence
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×