Rack Scale Serviceability & Telemetry Architect Job Austin area,Texas USA,IT/Tech

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover that the real differentiator is our culture.

We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.

Together, we advance your career. Rack Scale Serviceability & Telemetry Architect THE TEAM

AMD’s Data Center GPU Systems Architecture team defines next‑generation AMD Instinct platforms and complete rack‑scale solutions for hyperscale AI and HPC deployments. We work across silicon, GPU system firmware, server and board architecture, BMC/platform firmware, management software, security, validation, manufacturing, and ecosystem partners to turn product strategy into deployable, serviceable, production‑ready platforms.

THE ROLE

AMD is seeking a Principal Member of Technical Staff (PMTS) to own the architecture for rack‑scale serviceability and telemetry across AMD Instinct product lines and complete rack‑scale solutions. This is a highly visible technical leadership role responsible for defining the end‑to‑end manageability, observability, and serviceability architecture spanning node, chassis/tray, rack, and fleet domains. You will drive the strategy, architecture, execution, and delivery of standards‑based solutions for inventory, discovery, health monitoring, telemetry, eventing, diagnostics, firmware lifecycle management, and field service workflows across the full AMD rack‑scale stack.

In this role, you will independently own a critical cross‑product architecture area and drive alignment across GPU/SoC architecture, server/platform architecture, BIOS/UEFI, BMC and embedded software, security, RAS, validation, ODM/OEM partners, and customer‑facing teams. The role spans early concept definition through bring‑up, validation, deployment, and post‑launch improvement.

THE PERSON

The ideal candidate is a deeply technical system architect with strong first‑principles thinking and a track record of delivering manageability, telemetry, and serviceability solutions for servers, accelerators, storage, networking, or rack‑scale AI/HPC platforms. You are equally comfortable setting long‑range technical direction and diving hands‑on into protocol definitions, interface design, telemetry models, bring‑up, debug, and root‑cause analysis. You thrive in ambiguity, influence without authority, raise execution quality across teams, and exemplify AMD’s values through direct, humble, collaborative, and inclusive leadership.

KEY RESPONSIBILITIES

Define and own the end‑to‑end rack‑scale serviceability and telemetry architecture for AMD Instinct‑based solutions, spanning node BMC, chassis/rack management, service processors/controllers, management network, and fleet‑level observability integration.
Define the standards strategy and interface architecture using DMTF Redfish, PLDM, MCTP, and related specifications, maximizing standards compliance while establishing AMD/OEM extensions only where required.
Drive OpenBMC‑based architecture and implementation direction for BMC and rack management controllers, including D‑Bus object models, bmcweb/Redfish requirements, sensor and FRU inventory models, logging, eventing, firmware update, and debug workflows.
Architect telemetry frameworks for health, power, thermal, inventory, error, utilization, and service data. Define schemas, metric taxonomies, triggers, event models, aggregation, retention, and reporting strategies required for at‑scale observability and automated service operations.
Define platform serviceability flows covering discovery, inventory correlation, fault isolation, diagnostics, crash‑dump and error capture, remote recovery, FRU replacement, firmware/driver update…