Principal Engineer - AI/ML Forward Deployment Engineering Job Santa Clara area,California USA,IT/Tech

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture.

We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.
Together, we advance your career.

THE ROLE

The PMTS, DC GPU Advanced Forward Deployment and Systems Engineering is a leadership position designed to optimize the design, roll‑out and post‑rollout management of AI/ML Fabrics. The candidate will be the technical interface between the customers and various internal engineering groups, field application engineers. Leveraging extensive experience in large network architecture, storage, AI/ML network deployments, and performance tuning, this role requires a disciplined approach to system triage, at‑scale debug, and infrastructure optimization to ensure robust performance and efficient transitions from GPU production qualification to at‑scale datacenter deployment.

THE

PERSON

This position is for a PMTS, DC GPU Advanced Forward Deployment and Systems Engineering with a focus on architecture, design, optimizing the compute, network, and storage and benchmarking the Machine Learning applications. You will be part of a team closely working with strategic customers and partners to enable large‑scale deployment of AMD CPU and GPU platforms. You will closely interface with ROCm software developers, DC GPU HW/FW/ASIC Teams, Field Engineering Teams, OEM/ODM partners, CSPs, and Marketing/Business Development teams.

Must be self‑motivated and possess the ability to work well within a team environment.

KEY RESPONSIBILITIES

Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models.
Engage system‑level triage and at‑scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability.
Drive the ramp of Instinct‑based large‑scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads.
Enhance tools and methodologies for large‑scale deployments to meet customer uptime goals and exceed performance expectations.
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins.
Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement.
Engage with AMD product groups to drive resolution of application and customer issues.
Develop and present training materials to internal audiences, at customer venues, and at industry conferences.

PREFERRED EXPERIENCE

Expertise in networking and performance optimization for large‑scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements.
Prefer candidates with solid, hands‑on expertise in at least one or more of 3 domains, namely compute, network, storage.
Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials, etc.
Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it.
Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN‑EVPN, BGP, and Lossless Fabrics.
Proven ability to influence design and technology roadmaps, leveraging a deep understanding…


Increase/decrease your Search Radius (miles)



Job Posting Language

Principal Engineer - AI​/ML Forward Deployment Engineering

Principal Engineer - AI/ML Forward Deployment Engineering