×
Register Here to Apply for Jobs or Post Jobs. X

AI Infrastructure Engineer

Job in South Portland, Cumberland County, Maine, 04106, USA
Listing for: Hydra Host
Full Time position
Listed on 2026-07-02
Job specializations:
  • Software Development
    DevOps, Cloud Engineer - Software
Job Description & How to Apply Below

About Hydra Host

Hydra Host is a Founders Fund–backed NVIDIA cloud partner building the infrastructure platform that powers AI  connect AI Factories—high-performance GPU data centers—with the teams that depend on them: research labs training foundation models, enterprises running production inference, and developer platforms demanding scalable compute capacity. We operate where hardware meets software—the bare metal layer where reliability, performance, and speed matter most.

The Role

AI platform companies need more than raw GPU capacity. They need bare metal that's ready for their stack—Kubernetes clusters configured for multi-node inference, NVIDIA drivers tuned for their workloads, SLURM environments that work out of the box. Today, getting there requires white-glove onboarding. Your job is to change that.

As an AI Infrastructure Engineer, you'll work directly with AI platform customers to get their infrastructure running on Hydra. Work with platform partners (e.g., Northflank, Rafay, vCluster) to build reference deployments that provision through our API. You'll learn what breaks, what's missing, and what's harder than it should be—then work with Product and Engineering to turn those learnings into capabilities that ship in Brokkr, our core product.

The goal is to move from bespoke deployments to fully automated onboarding at scale.

What You'll Do
  • Get AI Platform customers production-ready on Hydra —standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
  • Own the bare metal ↔ platform layer —bridging GPU infrastructure (NCCL, Infini Band, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
  • Configure, benchmark, and debug NVIDIA driver stacks —firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
  • Identify gaps before customers do—pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken.
  • Turn customer learnings into product —working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
  • Advise customers on chip selection and tokenomics —helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.
What We're Looking For Required
  • Bare metal Linux depth —you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
  • NVIDIA GPU stack expertise —drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
  • Kubernetes and orchestration —production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
  • AI Networking fundamentals —TCP/IP, VLANs, bonding, and high-speed interconnects (Infini Band, RoCE) for distributed workloads.
  • Customer-facing communication —you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
  • Bias toward scalable solutions — you’d rather build a feature that helps 10 customers than a custom deployment that helps 1.
Nice to Have
  • HPC or large‑scale distributed training environments
  • AI workload experience (vLLM, PyTorch, inference frameworks)
  • Storage systems (NVMe, distributed file systems, CEPH, WEKA)
  • IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS)
Why This Role

You’ll work at the seam between bare metal and the orchestration layers AI teams actually use. You’ll define what it means to be production‑ready for our AI Platform customers. Every customer engagement sharpens your understanding of what needs to be built—and you’ll have direct influence over Hydra’s product roadmap to make it happen.

Why Hydra Host
  • Competitive salary — we pay fairly and transparently
  • Equity ownership — meaningful stake in what we’re building
  • Healthcare — medical, dental, vision for you and your family
  • Remote‑first — with hubs in Phoenix, Boulder, and Miami
  • Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary