×
Register Here to Apply for Jobs or Post Jobs. X

Sr. System Development Engineer, AL​/ML​/Storage server team

Job in Seattle, King County, Washington, 98127, USA
Listing for: Amazon
Full Time position
Listed on 2026-05-26
Job specializations:
  • IT/Tech
    Systems Engineer, Hardware Engineer
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Description

We are seeking an experienced Senior Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention.

You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge.

You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI).

Key

job responsibilities Fleet Health & Predictive Infrastructure
  • Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms
  • Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
  • Drive toward zero-touch operations — building automation that detects, diagnoses, triages, and remediates hardware and software faults without human intervention
  • Develop monitoring tools, dashboards, and alerting systems to provide real-time visibility into fleet health across lab and production environments
  • Define and track fleet health metrics (failure rates, mean time to detect, mean time to repair, first-time fix rate, predictive accuracy)
Debugging & Troubleshooting
  • Debug and resolve complex system-level issues across storage, compute, GPU, networking in production environments
  • Troubleshoot Linux boot and runtime failures across x86 and ARM architectures, including PCIe, power, NIC, NVMe, and GPU subsystems
  • Perform root cause analysis on hardware failures — correlating across firmware, kernel, driver, and physical layer to isolate faults
  • Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage
  • Improve manufacturing throughput and yield through test optimization
Systems Development & Automation
  • Lead the definition and development of software, automation, and enabling tools for server hardware programs; track and report progress
  • Design and build scalable system-level software with focus on durability, availability, security, and diagnostics
  • Develop and maintain device drivers for Linux on ARM and x86 architectures
  • Build automation solutions using modern programming languages (Python, Ruby, Java, C/C++, etc.)
  • Work with OS internals, storage subsystems, and accelerator/GPU software stacks in Linux-based environments
  • Build, manage, and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems
Cross-Team Collaboration
  • Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams
  • Work closely with internal customers to identify early any potential problems onboarding new servers — storage or accelerated compute — into their ecosystem
  • Engage with ODMs and design partners on testability, diagnostic, and automation requirements during hardware design and development
  • Contribute to server design to improve…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary