×
Register Here to Apply for Jobs or Post Jobs. X

Director of GPU Fleet Operations

Job in Redwood City, San Mateo County, California, 94061, USA
Listing for: Gruve
Full Time position
Listed on 2026-02-17
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 100000 - 125000 USD Yearly USD 100000.00 125000.00 YEAR
Job Description & How to Apply Below

About Gruve

Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early‑stage startup, Gruve offers a dynamic environment with strong customer and partner networks.

About

The Role

Gruve is a rapidly growing company enabling NEO Clouds to deliver GPU‑as‑a‑Service and AI infrastructure to AI‑native startups, enterprises, and research organizations. Our distributed fleet of GPU clusters spans colocation facilities, edge sites, and modular data centers globally, operating at the intersection of high‑performance computing, AI infrastructure, and cloud‑scale automation.

We are seeking a Director of GPU Fleet Operations to own the end‑to‑end lifecycle, reliability, and performance of our global GPU fleet, along with adjacent CPU and high‑performance storage clusters. This leader will drive strategy, execution, and scaling of hardware and infrastructure operations for thousands of GPUs across distributed environments, building remote operations teams, advancing automation, and leveraging AI to create a highly reliable, self‑healing GPU cloud platform.

Key Responsibilities Fleet Strategy & Operations
  • Own operational readiness, uptime, and performance of the global GPU fleet.
  • Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high‑speed networking (Infini Band/RoCE).
  • Standardize operations across liquid‑ and air‑cooled environments, colocation sites, and modular data centers.
  • Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management.
Hardware Lifecycle & Reliability
  • Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations.
  • Develop scalable processes for diagnostics, RMA coordination, spare‑parts forecasting, and reliability engineering.
  • Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization.
Remote Operations & NOC Leadership
  • Build and lead a 24×7 global remote operations organization.
  • Develop a remote‑first model to manage distributed clusters.
  • Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry.
Software & Platform Maintenance
  • Partner with Platform/Dev Ops teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow).
  • Oversee GPU drivers, firmware, CUDA stack, and configuration automation.
  • Own patching, upgrades, change management, and low‑impact maintenance practices.
  • Manage platform layers operating above Kubernetes, including agent infrastructure.
AI‑Driven Operations (AIOps)
  • Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation.
  • Build toward an autonomous, self‑healing GPU fleet through data‑driven automation.
Vendor & Field Coordination
  • Manage OEM and repair vendor relationships and enforce SLAs.
  • Coordinate global field technicians and remote hands support.
Capacity & Customer Operations
  • Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance.
  • Support large‑scale deployments, escalations, and on‑premise customer installations.
Team Leadership & Scaling
  • Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering.
  • Establish KPIs, dashboards, and operational reporting to support rapid growth.
Basic Qualifications
  • 10+ years of experience in infrastructure, data center, or cloud operations.
  • 5+ years managing distributed hardware fleets or large‑scale compute environments.
  • Experience operating GPU, HPC, or high‑performance compute clusters.
  • Proven experience leading 24×7 operations teams.
  • Strong technical understanding of:
    • GPU servers and accelerator infrastructure
    • High‑speed networking (Infini…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary