HPC Systems Administrator, Networking & Data Center Operations
Listed on 2026-05-16
-
IT/Tech
Systems Engineer, Data Engineer, Cloud Computing
Overview
Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.
By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.
The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).
Position SummaryThe HPC Systems Administrator, Networking & Data Center Operations will design, deploy, and maintain the high-speed network fabrics and physical data center infrastructure that underpin Empire AI's shared and distributed high-performance computing environments. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for ensuring the reliability, performance, and scalability of the network and data center systems that support AI/ML workloads, large scale simulations, and research computing across Empire AI's statewide consortium of academic and research institutions.
This role serves as the operational backbone of Empire AI's HPC infrastructure, translating architectural designs into hardened, high-availability environments while collaborating closely with systems, security, and research teams to meet the demands of cutting edge AI workloads across a federated, multi-institutional platform.
Duties and Responsibilities High-Performance Networking- Design, deploy, and maintain Infini Band (HDR/NDR) and RoCEv2/Ethernet fabrics for low-latency, high-throughput HPC and AI workloads across Empire AI's distributed cluster environment
- Implement and manage leaf-spine network architectures, EVPN-VXLAN overlays, and RDMA optimized configurations across federated, multi institutional environments
- Troubleshoot network layer performance bottlenecks, including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and multimodal AI workloads
- Perform cable plant management, optical transceiver diagnostics, and switch firmware upgrades across the data center fabric
- Evaluate and recommend emerging network hardware, interconnects, and architectures to meet Empire AI's evolving AI infrastructure needs
- Plan and execute hardware deployments including racking, stacking, and cabling of compute nodes, GPU servers, switches, and storage arrays
- Maintain DCIM (Data Center Infrastructure Management) records for accurate asset inventory, power mapping, and capacity planning across Empire AI's infrastructure
- Coordinate with facilities teams on power, cooling and physical security for compute environments
- Manage hardware lifecycle including procurement support, RMA processing, firmware/BIOS standardization, and decommissioning
- Conduct routine health checks and physical inspections; respond to hardware alerts and data center incidents.
- Develop and enforce data center standards for cable management, labeling, and physical documentation
- Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
- Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
- Ensure workload portability and compatibility across heterogeneous hardware and storage platforms
- Build and maintain comprehensive monitoring dashboards using Prometheus and Grafana to track cluster health, GPU utilization, network throughput, and job telemetry
- Develop automation for provisioning, health checks, firmware updates, and configuration management
- Implement…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).