Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training
Listed on 2026-02-03
-
Software Development
Machine Learning/ ML Engineer, AI Engineer, Data Scientist
Overview
Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), and we are seeking a Senior Software Engineer to join our ML Distributed Training team.
Responsibilities- Design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torch titan and Hugging Face libraries for the Neuron ecosystem.
- Develop and optimize mixed-precision and low-precision training techniques using BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality. Implement precision-aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats.
- Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. Partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities. Work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.
- Bachelor's degree in computer science or equivalent
- 5+ years of non-internship professional software development experience
- 5+ years of programming with at least one software programming language
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Experience as a mentor, tech lead or leading an engineering team
- Experience in machine learning, large scale training with LLMs and expertise in Py Torch
- Master's degree in computer science or equivalent
- Experience in computer architecture
- Previous software engineering expertise with PyTorch/Jax/Tensor Flow, distributed libraries and frameworks, end-to-end model training
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit (Use the "Apply for this Job" box below). for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave.
Learn more about our benefits at .
USA, CA, Cupertino - - USD annually
Job : A3168219
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).