HPC Systems Administrator Job Buffalo area,New York USA,IT/Tech

Empire AI is establishing New York as the national leader in responsible artificial intelligence. Backed by a consortium of top academic and research institutions including Columbia University, Cornell University, NYU, CUNY, RPI, SUNY, University of Rochester, RIT, Mount Sinai, and Flatiron Institute.

By leveraging the state's rich academic resources and research institutions, Empire AI is driving innovation in fields like medicine, education, energy, and climate change — all while giving New York's researchers access to computing resources that are often prohibitively expensive and only available to big tech companies, fueling statewide innovation, driving economic growth, and preparing a future-ready AI workforce to tackle society's most complex challenges.

The initiative is funded by $500+ million in public and private investments, State Capital Grant, Academic Institutions, Simons Foundation, Flatiron Institute, and Tom Secunda (Co-Founder of Bloomberg).

Position Summary

The HPC Systems Administrator will administer, optimize, and support the high-performance computing platforms that power Empire AI's AI/ML workloads, scientific research, and large-scale simulation across its statewide consortium. Reporting to the Manager, AI/ML Systems Administration, this role is responsible for the day to day cluster operations, job scheduling, GPU resource management, and systems reliability of Empire AI's distributed HPC infrastructure.

This role ensures that Empire AI's shared computing environments remain available, performant, and accessible to researchers across partner institutions. The HPC Systems Administrator works at the intersection of systems administration, AI/ML infrastructure support, and research computing, bridging the gap between complex user workloads and the underlying HPC platform.

Duties and Responsibilities

Deploy, configure, and maintain Linux-based HPC clusters (Rocky/Ubuntu) at scale, including compute, GPU, storage, and management nodes
Administer and optimize Slurm workload manager including partition design, QOS policies, fair-share accounting, and cross-institutional workload orchestration models
Manage NVIDIA GPU resources (H100/H200/GB200) including driver, CUDA, firmware, and NCCL lifecycle management for AI training and inference workloads
Administer cluster management platforms such as NVIDIA Base Command Manager (BCM) for provisioning and system lifecycle management
Support containerized and virtualized research environments using Apptainer/Singularity, Pyxis and Enroot
Troubleshoot performance bottlenecks including MPI/NCCL collective traffic patterns and rail optimized topologies for LLM and AI workloads
Administer parallel file systems such as Lustre and Vast and integrate with cluster storage workflows
Establish incident alerting and escalation procedures for HPC cluster and infrastructure.
Manage detailed monitoring dashboards (Prometheus, Grafana) to track critical metrics: network throughput, GPU utilization, cluster health, and job telemetry.

AI/ML Infrastructure Support

Architect and support systems for AI training and inference pipelines, including large language models (LLMs) and multimodal AI workloads
Tune and benchmark systems for GPU-intensive AI/ML frameworks including PyTorch and Tensor Flow
Work with research faculty to translate scientific goals into technical configurations and workload requirements
Evaluate emerging HPC hardware and software solutions, propose procurement recommendations aligned with AI/ML workload demands

Security & Compliance

Enforce security baselines, access control policies, and network segmentation across HPC environments
Integrate robust monitoring, alerting, access control, and disaster recovery planning into cluster operations
Partner with the Security & Compliance specialist to ensure security is integrated into system design and workload orchestration
Consult with research teams across consortium institutions to assess computational needs and advise on workflow optimization
Translate user feedback and researcher requirements into system-level improvements and configuration optimizations
Maintain clear system documentation, configuration…