Data Center Technical Manager
Listed on 2025-12-01
-
Engineering
Electrical Engineering, Systems Engineer, Engineering Design & Technologists
About the Role:
As the Technical Manager for the Argonne National Laboratory (ANL) Aurora Exascale Supercomputer Facility
, you will serve as the primary engineering authority supporting one of the most advanced high‑performance computing (HPC) environments in the world. This role provides specialized oversight of critical electrical, mechanical, cooling, and facility infrastructure directly supporting Aurora’s exascale compute capability.
You will act as both:
- Site Subject Matter Expert (SME) for Aurora’s high‑density compute, liquid‑cooling, and MEP systems; and
- Owner of Critical Environment Risk Management (CERM) for the ANL Aurora facility, ensuring the stability, safety, and resilience of the supporting critical infrastructure.
This position requires an advanced engineering skill set, deep infrastructure knowledge, and the ability to operate within an national laboratory environment, supporting DOE mission‑critical workloads and rigorous compliance requirements.
Essential Duties & Responsibilities- Ensure all Aurora infrastructure operations comply with DOE, ANL, CBRE, and vendor technical requirements, including contract‑specific engineering deliverables.
- Maintain and govern all critical infrastructure drawings for Aurora, including chilled water distribution, liquid‑cooling loops, electrical one‑lines, and HPC load distribution diagrams.
- Develop and maintain detailed site‑specific RACI matrices tailored to Aurora’s HPC operational structure.
- Perform staffing analysis to support 24/7 coverage, high‑density cooling management, and HPC system maintenance windows.
- Lead creation and updates of business continuity and disaster recovery plans specific to the high‑performance computing environment.
- Implement training and qualification programs for critical facilities engineers supporting Aurora’s unique cooling and power systems.
- Coordinate mock failure scenarios (e.g., chiller plant failures, power transitions, liquid‑cooling loop disruptions) to validate operational readiness.
- Oversee maintenance scheduling to align with Aurora compute job scheduling, minimizing scientific workload interruptions.
- Develop and refine operating procedures (SOP/EOP/MOP) specific to exascale HPC cooling, power sequencing, and system change protocols.
- Provide incident response support, including contribution to root cause analysis for any facility events that impact HPC availability.
- Manage Aurora’s power, cooling, and infrastructure capacity in line with exascale system demands and future technology refresh cycles.
- Maintain a site‑specific risk register covering HPC thermal loads, water systems, UPS distribution, and mission interruptions.
- Oversee airflow and liquid‑cooling optimization to support Aurora’s extreme compute power density.
- Assess lifecycle condition of assets supporting Aurora (UPC, CW pumps, cooling towers, CDUs, heat exchangers) and inform ANL capital planning.
- Develop and implement sustainability strategies in accordance with DOE sustainability goals, including water usage, energy efficiency, and PUE/WUE reduction.
Operation, maintenance, and repair of data center critical infrastructure, including:
- Standby generators, UPS systems, PDUs, ATSs supporting the Aurora supercomputer
- Large‑scale chilled water plants, liquid‑cooling distribution units, and plate heat exchangers
- CRAHs, CRACs, and Aurora‑specific cooling technologies (e.g., warm‑water loops, rear‑door exchangers)
- BMS, EPMS, CMMS, DCIM systems integrated with HPC telemetry systems
Engineering Knowledge of:
- Psychrometric charts, HVAC load calculations, and hydronic pipe sizing.
- Reading electrical one‑lines, chilled, and condenser water diagrams.
- Standard sequences of operation for electrical and mechanical data center systems.
- Electrical power calculations per NFPA 70 (NEC), coordination, arc‑flash studies (NFPA 70E), and maintenance practices (NFPA 70B).
- Industry standards, including ASHRAE Datacom/TC 9.9 and OCP publications.
- Principles of preventative, predictive, and reactive maintenance.
- Energy efficiency metrics (e.g., PUE, WUE) and sustainable data center practices.
- Ability to analyze performance data from Aurora’s environmental telemetry, SCADA…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).