Cloud Hardware Development Engineer,Cloud AI/ML/storage server teams Job Cupertino area,California USA,Engineering

Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams

Job : | Amazon Data Services, Inc.

As a Cloud Hardware Development Engineer, you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms — from New Product Introduction (NPI) through fleet health in production.

You will work closely with internal customers to understand technical needs and business goals, leveraging your experience to architect solutions at scale.

In this role you will collaborate with component, firmware, power, mechanical, electrical, test, qualification, and manufacturing engineers, and lead our ODM partners to bring servers to the data center. After launch, you will monitor quality, drive reliability improvements, and ensure ongoing operational excellence.

Key Responsibilities

NPI – New Product Introduction
- Own the end‑to‑end NPI lifecycle for storage and/or accelerator server platforms—from architecture definition through design, qualification, manufacturing ramp, and launch.
- Lead technical solutions for complex server and rack system architectural challenges.
- Work with ODM / manufacturing partners to develop, validate, and manufacture server products at scale.
- Develop functional specifications, design verification plans, and test procedures.
- Drive qualification and readiness milestones, ensuring new platforms meet performance, reliability, and cost targets before fleet deployment.
- Identify and resolve technical risks early in the development cycle—to prevent problems from reaching production.
Fleet Health, Diagnostics & Automation
- Own fleet health for the launched server platforms—responsibility extends beyond ship.
- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation.
- Drive toward zero‑touch operations—build detection, diagnoses, and remediation of faults without human intervention.
- Debug complex system failures in time‑sensitive settings and perform root‑cause analysis across firmware, kernel, driver, thermal, power, and physical layers.
Systems Design & Technical Depth
- Apply expertise across hardware, software, system design, x86 architecture, and operations.
- Design and implement solutions to address system‑level issues at large scale.
- Collaborate with hardware, software, manufacturing, supply chain, and product teams.
Cross‑Team Collaboration
- Work closely with internal customers to ensure new hardware meets data path and control path requirements.
- Identify potential problems early when onboarding servers into customer ecosystems.
- Partner with datacenter operations to close the loop between field failures and design improvements.
A Day in the Life
Your day-to-day work includes interfacing with internal and external customers, reviewing platform designs with ODMs, deepening analysis of logs, and chasing fleet failures. Your role requires a range of responsibilities that continually challenge you.

Basic Qualifications

Experience in developing functional specifications, design verification plans, and functional test procedures.
Bachelor's degree or higher in electrical, computer engineering, or equivalent.
Proficient English‑language communication skills, both written and verbal.
Experience in design, innovation, and research & development.
Knowledge of operating systems, hardware, storage, network, security, database administration, and cloud infrastructure.
Experience with server technologies such as thermal, mechanical, power, and signal integrity.
5+ years of professional work (non‑internship) experience.

Preferred Qualifications

5+ years of hardware design and validation of components, subsystems, and systems.
Experience with server technologies: board design, high‑speed bus design, signal integrity, failure analysis, CPU, GPU, SSD, memory, BIOS, BMC, and networking.
Experience developing and executing test procedures for mechanical or electrical systems.
Experience working with ODMs throughout product development and manufacturing lifecycle.
Experience building predictive failure detection or proactive remediation systems at fleet scale.
Experience with storage/compute/GPU/accelerator platforms—including integration, diagnostics,…

Cloud Hardware Development Engineer, Cloud AI​/ML​/storage server teams

Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams