Compute Platform Engineer — GPU Infrastructure - Rapidly AI start up Job New York New York USA,IT/Tech

Position: Compute Platform Engineer — GPU Infrastructure - Rapidly growing AI start up
Location: New York

Our client is a frontier AI company building at the cutting edge of what is possible in artificial intelligence. Well‑funded, talent‑dense, and moving with genuine urgency. They are not building on top of someone else’s foundation. They are building the foundation itself. The team is small by design but growing fast, and every engineer they hire has a direct line to the infrastructure decisions that matter.

They are already generating significant revenue with marquee enterprise and government clients.

The Role

This is a Compute Platform engineering role focused on the GPU infrastructure layer that powers large‑scale model training. You will not be inheriting someone else’s architecture and maintaining it. You will be shaping it, working alongside the training teams to co‑design fault tolerance, cluster health strategies, and remediation workflows that determine how reliably and efficiently the company trains its models.

What You Will Be Working On

Cluster health monitoring, automatic node remediation, and topology‑aware scheduling across large multi‑GPU fleets. GPU‑to‑GPU network performance tuning and debugging h‑performance storage management across multiple data centres, including datasets and checkpointing at petabyte scale. Capacity planning and hardware preparation for next‑generation GPU deployments, Blackwell hardware is already in production.

What They Are Looking For

Strong systems‑level engineering experience with a focus on cluster‑wide behaviour rather than individual service reliability.
Hands‑on experience operating large GPU fleets, not just scheduling workloads on them, but understanding what happens at the hardware and network layer when things go wrong.
Experience operating and managing large GPU clusters at scale (5000+ ideally)
Familiarity with NCCL and GPU‑to‑GPU communication.
Experience with high‑performance storage products such as VAST or Lustre across multiple data centres.
Strong coding ability in Go, C++ or Python.
Kubernetes‑first mindset with the depth to operate below the abstraction when needed. Prior exposure to Infini Band or bare metal GPU provisioning is a significant advantage.

What Is On Offer

Base salary up to $600,000 depending on level and experience. Equity packages starting in the millions, with long‑term upside tied directly to the company’s trajectory. Comprehensive benefits. The opportunity to join a real rocket ship at the perfect time to realize real wealth creation.

On‑site in London or New York - San Francisco will be considered but their focus is on London and NY for now.

#J-18808-Ljbffr