More jobs:
Site Reliability Engineer - AI Infrastructure
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-05-07
Listing for:
Hamilton Barnes Associates Limited
Full Time
position Listed on 2026-05-07
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, SRE/Site Reliability, Network Engineer
Job Description & How to Apply Below
Are you looking for an exciting new opportunity?
Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.
Responsibilities- Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
- Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
- Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
- Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow.
- Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
- Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
- 7+ years of experience in SRE, Dev Ops, or Infrastructure Engineering roles supporting large-scale compute environments.
- Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
- Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
- Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
- Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
- Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
- Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
- IPO Equity
- 10% company bonus
- 401K 4% match
- $250,000 gross per year
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×