×
Register Here to Apply for Jobs or Post Jobs. X

Principal PMT-ES - AI​/ML Training, Annapurna Labs

Job in Cupertino, Santa Clara County, California, 95014, USA
Listing for: Amazon Web Services (AWS)
Apprenticeship/Internship position
Listed on 2026-04-22
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software), Machine Learning/ ML Engineer, IT Support
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Overview

AWS Trainium is deployed at scale, with millions of chips in production, used for training and inference of frontier models. AWS Neuron is the software stack for Trainium, enabling customers to run deep learning and generative AI workloads with optimal performance and cost efficiency. AWS Neuron is hiring a Principal Technical Product Manager to define and drive product strategy for training software on Trainium.

This includes distributed training libraries, post-training workflows (RLHF, DPO, fine-tuning), reinforcement learning frameworks, and training performance optimization. Your mission is to enable researchers and operators to train frontier models at scale on Trainium, from single-node experimentation to distributed training across thousands of nodes.

Responsibilities
  • Define and execute training product strategy and roadmap, working backwards from customer requirements in collaboration with engineering leadership. Define the vision for how customers train frontier models at scale on Trainium, balancing performance, developer experience, and ecosystem compatibility. Produce PRFAQs and PRDs for training capabilities. Drive technical alignment across Neuron training libraries, distributed training infrastructure, and dependencies. Partner with PMs responsible for compiler, NKI, runtime, and infrastructure.

    Drive trade-offs between training performance, scalability, developer experience, and ecosystem compatibility. Define requirements for reusable training building blocks that compose into end-to-end workflows.
  • Drive strategy for post-training workflows including RLHF, DPO, reward modeling, and fine-tuning ine requirements for how Neuron supports emerging training paradigms, model architectures, and RL-based optimization loops. Lead the product experience for RL research-to-production workflows on Trainium. Create and optimize RL libraries and frameworks to help researchers and production model builders.
  • Engage with customers and internal teams (BD, Solutions Architecture, GTM) to understand distributed training challenges, RL needs, performance optimization requirements, and framework preferences. Translate customer pain points into product requirements. Define success metrics for training adoption and performance. Support customer enablement for training migration and optimization.
  • Define how Neuron supports the training AI/ML ecosystem and the tools customers will use for their training workflows on Trainium. Own the technical depth on training-specific ecosystem tools and integrate Neuron s training libraries with them. Track ecosystem trends and feed insights into product planning. Drive open source community engagement and upstream contributions for training-related tools. Coordinate with BD on partnership discussions where training-specific technical input is needed.
  • Lead end-to-end launches for training capabilities, coordinating documentation, field enablement, and customer communications. Partner with Marketing and Solutions Architecture to drive awareness and adoption. Define launch success criteria and track adoption metrics.
QualificationsBasic Qualifications
  • 7+ years of experience as a Technical Product Manager
  • Bachelor's degree in computer science, engineering, analytics, mathematics, statistics, IT or equivalent
  • Experience with large-scale model training workflows and distributed training concepts
  • Familiarity with major AI/ML training frameworks (JAX or PyTorch) and how training libraries interact with them
  • Experience driving product strategy, long-term roadmap development, and cross-organizational alignment
  • Excellent written and verbal communication abilities, including executive-level communication
Preferred Qualifications
  • Experience with PyTorch or JAX distributed training
  • Track record of driving developer training libraries and tools
  • Experience with design and scaling of training optimization software (e.g., NeMo, Torch Titan, TRL, VeRL, Max Text, AXLearn, or similar)
  • Experience leading RL for research-to-production at scale
  • Experience with post-training workflows including RLHF, DPO, reward modeling, and fine-tuning
  • Experience with AI/ML training accelerators and hardware, including training performance optimization, profiling, and tooling
  • Experience with distributed training of large-scale models including model parallel training techniques (tensor, pipeline, sequence, and expert parallelism)
  • Experience working on open source and Git Hub-first developer products with deep customer interactions
  • Track record of driving open standards and ecosystem integration for training workflows
  • Experience operating in early-stage, ambiguous environments with startup-like velocity

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Los Angeles County applicants: job duties include working safely, communicating effectively, and following all laws and company policies. Criminal history may have a direct relation…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary