Site Reliability Engineer - AI Infrastructure

Job in San Francisco, San Francisco County, California, 94199, USA

Listing for: Hamilton Barnes Associates Limited

Full Time position
Listed on 2026-05-07

Job specializations:

IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Network Engineer

Salary/Wage Range or Industry Benchmark: 250000 USD Yearly USD 250000.00 YEAR

Are you looking for an exciting new opportunity?

Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.

Responsibilities

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

7+ years of experience in SRE, Dev Ops, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

IPO Equity
10% company bonus
401K 4% match

Salary

$250,000 gross per year

#J-18808-Ljbffr