Technical Program Manager- AI Cluster Engineering
Listed on 2026-06-04
-
IT/Tech
Systems Engineer
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture.
We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
We are seeking an experienced Technical Program Manager to drive end-to-end execution of AI cluster engineering programs spanning GPU platforms, rack-scale solutions, high-speed networking, and datacenter AI infrastructure. You will work cross-functionally to translate customer and internal requirements into executable plans, manage risks and dependencies, and deliver scalable, production-ready solutions across GPU → rack → cluster deployments.
THE PERSONYou are a hands‑on TPM who thrives in complex, fast-moving ecosystems, and can connect deep technical details to crisp program plans, executive reporting, and customer outcomes. You will partner with cross functional teams to help server integration to rack and cluster-level validation. You bring strong ownership, structured execution, and the ability to lead through influence across engineering, operations, vendors, and customers.
KEY RESPONSIBILITIES Program Leadership & Execution- Define, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring‑up, and cluster‑scale deployment readiness.
- Create and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports.
- Identify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas.
- Own program execution for rack- and cluster-network enablement, including topology decisions, switching/optics/cabling readiness, and validation schedules for scale‑out operation.
- Drive alignment on advanced AI networking requirements such as network architecture, and reliability impacts that require mitigation.
- Partner with internal/external stakeholders to track and close network blockers.
- Lead cross‑functional delivery for rack solutions that integrate CPU + GPU + NICs, ensuring end-to-end readiness across hardware, firmware, and management interfaces.
- Drive requirements capture and execution planning for rack‑scale deployments (rack density, rack form factor, power targets, whips, liquid cooling, etc.) and ensure integration plans are validated with engineering and operations.
- Own program coordination for pod/rack manageability solutions, aligning requirements and milestones for inventory, health monitoring, cluster provisioning, and observability across large‑scale deployments.
- Coordinate with platform/automation teams on cluster provisioning and orchestration.
- Drive readiness for rack‑level automation and regression workflows (scripts, log mapping, infrastructure automation planning), planning execution to de‑risk hardware arrival timing.
- Partner with CI/CD and FW automation stakeholders to align on deliverables, and validation gates.
- Proven program management experience delivering complex, cross‑functional hardware/software infrastructure programs (server/rack/cluster environments).
- Strong understanding of datacenter building blocks and lifecycle: servers, racks, clusters, HW/FW/SW integration, and readiness/validation flows.
- Demonstrated ability to build and run schedules, manage risks, lead matrix teams, and communicate clearly to engineering and executive…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).