×
Register Here to Apply for Jobs or Post Jobs. X

HPC Data Center Operational Lead

Job in New York, New York County, New York, 10261, USA
Listing for: P2P
Full Time position
Listed on 2026-05-23
Job specializations:
  • IT/Tech
    Systems Engineer, Hardware Engineer, IT Support
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below
Location: New York

HPC Infrastructure Operations Lead

Location:

Chicago or New York (On‑site 5 days/week; regular travel to HPC data center sites required)

Jump's HPC infrastructure powers some of the most demanding computational workloads in the industry. As our HPC footprint grows, we need a seasoned operations leader to own the reliability, standards, and day‑to‑day excellence of these environments.

What You'll Do:
  • Team Leadership & Organizational Ownership
    • Lead and manage data center site leads and their teams across multiple HPC facilities; site leads report directly to this role.
    • Recruit, mentor, and develop team members while conducting performance reviews and building a culture of operational rigor.
    • Direct onsite contractors by providing clear scope and validating completed work.
  • HPC Data Center Standards, Processes & Preventative Maintenance
    • Develop, document, and enforce operational standards and procedures for Jump's HPC data centers covering power, cooling, cabling, and hardware lifecycle.
    • Design and own the preventative maintenance program, including scheduled inspections, component replacements, and firmware/capacity reviews to minimize unplanned downtime.
    • Drive continuous improvement of operational processes and pursue automation—including AI‑driven approaches—to reduce manual effort and human error.
  • Critical Facility Systems Expertise
    • Serve as the subject matter authority on HPC data center power distribution, power striping strategies, and failover/redundancy configurations.
    • Own expertise across air cooling, liquid cooling (direct‑to‑chip, rear‑door, CDU‑based), and hybrid cooling architectures.
    • Maintain deep knowledge of environmental monitoring and controls (temperature, humidity, airflow, leak detection) and ensure systems remain within design parameters.
  • Monitoring & Incident Response
    • Own the HPC data center monitoring strategy end‑to‑end: define what is monitored, set alerting thresholds, and ensure comprehensive visibility into facility and hardware health.
    • Leverage AI tools to analyze telemetry data, identify failure patterns, predict potential issues, and accelerate root cause analysis during incidents.
    • Lead critical incident response and drive root cause analysis and corrective actions to prevent recurrence.
    • Establish and track operational KPIs including availability, mean time to repair, and efficiency metrics.
  • Server & Switch Hardware Expertise
    • Maintain deep, hands‑on knowledge of server hardware architectures including multi‑socket platforms, GPU/accelerator configurations, memory subsystems, NVMe/storage controllers, BMC/IPMI management, and firmware lifecycle.
    • Maintain deep, hands‑on knowledge of network switch hardware including line cards, optics/transceivers, switch fabrics, and platform‑specific diagnostics for Arista and Cisco platforms.
    • Evaluate new hardware platforms, drive hardware qualification and acceptance testing, and provide informed recommendations on hardware selection.
  • Hardware Break‑Fix
    • Own the overall hardware break‑fix function across all HPC sites, ensuring rapid diagnosis and resolution for servers, GPUs, network equipment, storage, and facility infrastructure.
    • Diagnose complex hardware failures at the component level—CPUs, DIMMs, GPUs, NICs, PSUs, fans, drives, switch line cards, and optics—and direct the team to resolve efficiently.
    • Establish escalation paths, SLA targets, and reporting for hardware failures.
  • Inventory & Spares Management
    • Own inventory processes and spares tracking across all HPC facilities, ensuring critical spares are stocked, tracked, and replenished to meet availability targets.
    • Maintain accurate asset records for all serialized and consumable inventory.
  • Planning, Vendor & Budget Management
    • Conduct capacity planning for space, power, cooling, and cabling to stay ahead of growth.
    • Gather requirements and plan new hardware installations including physical placement, power/cooling needs, and cabling.
    • Manage relationships with colocation providers and hardware vendors; negotiate contracts and SLAs.
    • Develop and manage operational budgets for equipment, staffing, and facilities.
  • Networking & Linux
    • Possess strong working knowledge of networking concepts including L2/L3…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary