Director of GPU Fleet Operations Job Redwood City area,California USA,IT/Tech

About Gruve

Gruve is an innovative software services startup dedicated to transforming enterprises to AI powerhouses. We specialize in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs). Our mission is to assist our customers in their business strategies utilizing their data to make more intelligent decisions. As a well-funded early‑stage startup, Gruve offers a dynamic environment with strong customer and partner networks.

About

The Role

Gruve is a rapidly growing company enabling NEO Clouds to deliver GPU‑as‑a‑Service and AI infrastructure to AI‑native startups, enterprises, and research organizations. Our distributed fleet of GPU clusters spans colocation facilities, edge sites, and modular data centers globally, operating at the intersection of high‑performance computing, AI infrastructure, and cloud‑scale automation.

We are seeking a Director of GPU Fleet Operations to own the end‑to‑end lifecycle, reliability, and performance of our global GPU fleet, along with adjacent CPU and high‑performance storage clusters. This leader will drive strategy, execution, and scaling of hardware and infrastructure operations for thousands of GPUs across distributed environments, building remote operations teams, advancing automation, and leveraging AI to create a highly reliable, self‑healing GPU cloud platform.

Key Responsibilities Fleet Strategy & Operations

Own operational readiness, uptime, and performance of the global GPU fleet.
Define and implement operational standards across OEM platforms (NVIDIA, Cisco, Dell, Supermicro, and others), GPU servers (NVIDIA, AMD, XPUs), and high‑speed networking (Infini Band/RoCE).
Standardize operations across liquid‑ and air‑cooled environments, colocation sites, and modular data centers.
Establish global processes for provisioning, monitoring, maintenance, incident response, and lifecycle management.

Hardware Lifecycle & Reliability

Build and manage the full hardware lifecycle from deployment through retirement, leveraging outsourced resources for remote site operations.
Develop scalable processes for diagnostics, RMA coordination, spare‑parts forecasting, and reliability engineering.
Define and track fleet SLOs/SLAs including availability, MTTR, MTBF, and utilization.

Remote Operations & NOC Leadership

Build and lead a 24×7 global remote operations organization.
Develop a remote‑first model to manage distributed clusters.
Implement standardized runbooks, escalation paths, and observability across hardware, performance, power, cooling, and environmental telemetry.

Software & Platform Maintenance

Partner with Platform/Dev Ops teams to maintain cluster software stacks (Kubernetes, Slurm, Kubeflow).
Oversee GPU drivers, firmware, CUDA stack, and configuration automation.
Own patching, upgrades, change management, and low‑impact maintenance practices.
Manage platform layers operating above Kubernetes, including agent infrastructure.

AI‑Driven Operations (AIOps)

Lead adoption of AI/ML for predictive failure detection, anomaly detection, alert triage, and automated remediation.
Build toward an autonomous, self‑healing GPU fleet through data‑driven automation.

Vendor & Field Coordination

Manage OEM and repair vendor relationships and enforce SLAs.
Coordinate global field technicians and remote hands support.

Capacity & Customer Operations

Partner with Customer Success and Capacity Planning teams to ensure GPU availability and performance.
Support large‑scale deployments, escalations, and on‑premise customer installations.

Team Leadership & Scaling

Hire and lead teams across hardware operations, reliability engineering, NOC, and automation engineering.
Establish KPIs, dashboards, and operational reporting to support rapid growth.

Basic Qualifications

10+ years of experience in infrastructure, data center, or cloud operations.
5+ years managing distributed hardware fleets or large‑scale compute environments.
Experience operating GPU, HPC, or high‑performance compute clusters.
Proven experience leading 24×7 operations teams.
Strong technical understanding of:
- GPU servers and accelerator infrastructure
- High‑speed networking (Infini…


Increase/decrease your Search Radius (miles)



Job Posting Language