Director of GPU Fleet Operations
Listed on 2026-02-17
-
IT/Tech
Systems Engineer, Cloud Computing
About Gruve
Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early‑stage startup, Gruve offers a dynamic environment with strong customer and partner networks.
AboutThe Role
Gruve is a rapidly growing company enabling NEO Clouds to deliver GPU‑as‑a‑Service and AI infrastructure to AI‑native startups, enterprises, and research organizations. Our distributed fleet of GPU clusters spans colocation facilities, edge sites, and modular data centers globally, operating at the intersection of high‑performance computing, AI infrastructure, and cloud‑scale automation.
We are seeking a Director of GPU Fleet Operations to own the end‑to‑end lifecycle, reliability, and performance of our global GPU fleet, along with adjacent CPU and high‑performance storage clusters. This leader will drive strategy, execution, and scaling of hardware and infrastructure operations for thousands of GPUs across distributed environments, building remote operations teams, advancing automation, and leveraging AI to create a highly reliable, self‑healing GPU cloud platform.
Key Responsibilities Fleet Strategy & Operations- Own operational readiness, uptime, and performance of the global GPU fleet.
- Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high‑speed networking (Infini Band/RoCE).
- Standardize operations across liquid‑ and air‑cooled environments, colocation sites, and modular data centers.
- Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management.
- Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations.
- Develop scalable processes for diagnostics, RMA coordination, spare‑parts forecasting, and reliability engineering.
- Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization.
- Build and lead a 24×7 global remote operations organization.
- Develop a remote‑first model to manage distributed clusters.
- Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry.
- Partner with Platform/Dev Ops teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow).
- Oversee GPU drivers, firmware, CUDA stack, and configuration automation.
- Own patching, upgrades, change management, and low‑impact maintenance practices.
- Manage platform layers operating above Kubernetes, including agent infrastructure.
- Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation.
- Build toward an autonomous, self‑healing GPU fleet through data‑driven automation.
- Manage OEM and repair vendor relationships and enforce SLAs.
- Coordinate global field technicians and remote hands support.
- Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance.
- Support large‑scale deployments, escalations, and on‑premise customer installations.
- Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering.
- Establish KPIs, dashboards, and operational reporting to support rapid growth.
- 10+ years of experience in infrastructure, data center, or cloud operations.
- 5+ years managing distributed hardware fleets or large‑scale compute environments.
- Experience operating GPU, HPC, or high‑performance compute clusters.
- Proven experience leading 24×7 operations teams.
- Strong technical understanding of:
- GPU servers and accelerator infrastructure
- High‑speed networking (Infini…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).