More jobs:
HPC Data Center Operational Lead
Job in
New York, New York County, New York, 10261, USA
Listed on 2026-05-23
Listing for:
P2P
Full Time
position Listed on 2026-05-23
Job specializations:
-
IT/Tech
Systems Engineer, Hardware Engineer, IT Support
Job Description & How to Apply Below
HPC Infrastructure Operations Lead
Location:
Chicago or New York (On‑site 5 days/week; regular travel to HPC data center sites required)
Jump's HPC infrastructure powers some of the most demanding computational workloads in the industry. As our HPC footprint grows, we need a seasoned operations leader to own the reliability, standards, and day‑to‑day excellence of these environments.
What You'll Do:- Team Leadership & Organizational Ownership
- Lead and manage data center site leads and their teams across multiple HPC facilities; site leads report directly to this role.
- Recruit, mentor, and develop team members while conducting performance reviews and building a culture of operational rigor.
- Direct onsite contractors by providing clear scope and validating completed work.
- HPC Data Center Standards, Processes & Preventative Maintenance
- Develop, document, and enforce operational standards and procedures for Jump's HPC data centers covering power, cooling, cabling, and hardware lifecycle.
- Design and own the preventative maintenance program, including scheduled inspections, component replacements, and firmware/capacity reviews to minimize unplanned downtime.
- Drive continuous improvement of operational processes and pursue automation—including AI‑driven approaches—to reduce manual effort and human error.
- Critical Facility Systems Expertise
- Serve as the subject matter authority on HPC data center power distribution, power striping strategies, and failover/redundancy configurations.
- Own expertise across air cooling, liquid cooling (direct‑to‑chip, rear‑door, CDU‑based), and hybrid cooling architectures.
- Maintain deep knowledge of environmental monitoring and controls (temperature, humidity, airflow, leak detection) and ensure systems remain within design parameters.
- Monitoring & Incident Response
- Own the HPC data center monitoring strategy end‑to‑end: define what is monitored, set alerting thresholds, and ensure comprehensive visibility into facility and hardware health.
- Leverage AI tools to analyze telemetry data, identify failure patterns, predict potential issues, and accelerate root cause analysis during incidents.
- Lead critical incident response and drive root cause analysis and corrective actions to prevent recurrence.
- Establish and track operational KPIs including availability, mean time to repair, and efficiency metrics.
- Server & Switch Hardware Expertise
- Maintain deep, hands‑on knowledge of server hardware architectures including multi‑socket platforms, GPU/accelerator configurations, memory subsystems, NVMe/storage controllers, BMC/IPMI management, and firmware lifecycle.
- Maintain deep, hands‑on knowledge of network switch hardware including line cards, optics/transceivers, switch fabrics, and platform‑specific diagnostics for Arista and Cisco platforms.
- Evaluate new hardware platforms, drive hardware qualification and acceptance testing, and provide informed recommendations on hardware selection.
- Hardware Break‑Fix
- Own the overall hardware break‑fix function across all HPC sites, ensuring rapid diagnosis and resolution for servers, GPUs, network equipment, storage, and facility infrastructure.
- Diagnose complex hardware failures at the component level—CPUs, DIMMs, GPUs, NICs, PSUs, fans, drives, switch line cards, and optics—and direct the team to resolve efficiently.
- Establish escalation paths, SLA targets, and reporting for hardware failures.
- Inventory & Spares Management
- Own inventory processes and spares tracking across all HPC facilities, ensuring critical spares are stocked, tracked, and replenished to meet availability targets.
- Maintain accurate asset records for all serialized and consumable inventory.
- Planning, Vendor & Budget Management
- Conduct capacity planning for space, power, cooling, and cabling to stay ahead of growth.
- Gather requirements and plan new hardware installations including physical placement, power/cooling needs, and cabling.
- Manage relationships with colocation providers and hardware vendors; negotiate contracts and SLAs.
- Develop and manage operational budgets for equipment, staffing, and facilities.
- Networking & Linux
- Possess strong working knowledge of networking concepts including L2/L3…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×