×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer - AI Infrastructure

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Hamilton Barnes Associates Limited
Full Time position
Listed on 2026-05-07
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Network Engineer
Salary/Wage Range or Industry Benchmark: 250000 USD Yearly USD 250000.00 YEAR
Job Description & How to Apply Below

Are you looking for an exciting new opportunity?

Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.

Responsibilities
  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Skills / Must Have
  • 7+ years of experience in SRE, Dev Ops, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits
  • IPO Equity
  • 10% company bonus
  • 401K 4% match
Salary
  • $250,000 gross per year
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary