×
Register Here to Apply for Jobs or Post Jobs. X

HPC Systems Administrator, Networking & Data Center Operations

Job in Buffalo, Erie County, New York, 14266, USA
Listing for: Empire AI
Full Time position
Listed on 2026-05-16
Job specializations:
  • IT/Tech
    Systems Engineer, Data Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 60000 - 80000 USD Yearly USD 60000.00 80000.00 YEAR
Job Description & How to Apply Below

Overview

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.

By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.

The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).

Position Summary

The HPC Systems Administrator, Networking & Data Center Operations will design, deploy, and maintain the high-speed network fabrics and physical data center infrastructure that underpin Empire AI's shared and distributed high-performance computing environments. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for ensuring the reliability, performance, and scalability of the network and data center systems that support AI/ML workloads, large scale simulations, and research computing across Empire AI's statewide consortium of academic and research institutions.

This role serves as the operational backbone of Empire AI's HPC infrastructure, translating architectural designs into hardened, high-availability environments while collaborating closely with systems, security, and research teams to meet the demands of cutting edge AI workloads across a federated, multi-institutional platform.

Duties and Responsibilities High-Performance Networking
  • Design, deploy, and maintain Infini Band (HDR/NDR) and RoCEv2/Ethernet fabrics for low-latency, high-throughput HPC and AI workloads across Empire AI's distributed cluster environment
  • Implement and manage leaf-spine network architectures, EVPN-VXLAN overlays, and RDMA optimized configurations across federated, multi institutional environments
  • Troubleshoot network layer performance bottlenecks, including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and multimodal AI workloads
  • Perform cable plant management, optical transceiver diagnostics, and switch firmware upgrades across the data center fabric
  • Evaluate and recommend emerging network hardware, interconnects, and architectures to meet Empire AI's evolving AI infrastructure needs
Data Center Operations
  • Plan and execute hardware deployments including racking, stacking, and cabling of compute nodes, GPU servers, switches, and storage arrays
  • Maintain DCIM (Data Center Infrastructure Management) records for accurate asset inventory, power mapping, and capacity planning across Empire AI's infrastructure
  • Coordinate with facilities teams on power, cooling and physical security for compute environments
  • Manage hardware lifecycle including procurement support, RMA processing, firmware/BIOS standardization, and decommissioning
  • Conduct routine health checks and physical inspections; respond to hardware alerts and data center incidents.
  • Develop and enforce data center standards for cable management, labeling, and physical documentation
  • Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
  • Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
  • Ensure workload portability and compatibility across heterogeneous hardware and storage platforms
Monitoring, Automation & Observability
  • Build and maintain comprehensive monitoring dashboards using Prometheus and Grafana to track cluster health, GPU utilization, network throughput, and job telemetry
  • Develop automation for provisioning, health checks, firmware updates, and configuration management
  • Implement…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary