HPC Systems Administrator,Networking & Data Center Operations Job Buffalo area,New York USA,IT/Tech

Overview

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.

By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.

The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).

Position Summary

The HPC Systems Administrator, Networking & Data Center Operations will design, deploy, and maintain the high-speed network fabrics and physical data center infrastructure that underpin Empire AI's shared and distributed high-performance computing environments. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for ensuring the reliability, performance, and scalability of the network and data center systems that support AI/ML workloads, large scale simulations, and research computing across Empire AI's statewide consortium of academic and research institutions.

This role serves as the operational backbone of Empire AI's HPC infrastructure, translating architectural designs into hardened, high-availability environments while collaborating closely with systems, security, and research teams to meet the demands of cutting edge AI workloads across a federated, multi-institutional platform.

Duties and Responsibilities High-Performance Networking

Design, deploy, and maintain Infini Band (HDR/NDR) and RoCEv2/Ethernet fabrics for low-latency, high-throughput HPC and AI workloads across Empire AI's distributed cluster environment
Implement and manage leaf-spine network architectures, EVPN-VXLAN overlays, and RDMA optimized configurations across federated, multi institutional environments
Troubleshoot network layer performance bottlenecks, including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and multimodal AI workloads
Perform cable plant management, optical transceiver diagnostics, and switch firmware upgrades across the data center fabric
Evaluate and recommend emerging network hardware, interconnects, and architectures to meet Empire AI's evolving AI infrastructure needs

Data Center Operations

Plan and execute hardware deployments including racking, stacking, and cabling of compute nodes, GPU servers, switches, and storage arrays
Maintain DCIM (Data Center Infrastructure Management) records for accurate asset inventory, power mapping, and capacity planning across Empire AI's infrastructure
Coordinate with facilities teams on power, cooling and physical security for compute environments
Manage hardware lifecycle including procurement support, RMA processing, firmware/BIOS standardization, and decommissioning
Conduct routine health checks and physical inspections; respond to hardware alerts and data center incidents.
Develop and enforce data center standards for cable management, labeling, and physical documentation
Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
Ensure workload portability and compatibility across heterogeneous hardware and storage platforms

Monitoring, Automation & Observability

Build and maintain comprehensive monitoring dashboards using Prometheus and Grafana to track cluster health, GPU utilization, network throughput, and job telemetry
Develop automation for provisioning, health checks, firmware updates, and configuration management
Implement…