×
Register Here to Apply for Jobs or Post Jobs. X

Infrastructure​/Cluster Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Linuxcareers
Full Time position
Listed on 2026-06-15
Job specializations:
  • IT/Tech
    Systems Engineer, IT Infrastructure, SRE/Site Reliability, Cloud Computing: Infrastructure & Operations
Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR
Job Description & How to Apply Below
Position: Infrastructure / Cluster Engineer

Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse hardware architectures to enable production AI inference at scale.

What You'll Do
  • Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference workloads
  • Build automation for provisioning, configuration, upgrades, validation, and lifecycle management across heterogeneous bare-metal infrastructure
  • Debug complex production issues spanning Linux, networking, storage, drivers, firmware, and orchestration layers
  • Build and operate high-performance networking infrastructure including RDMA-enabled environments and accelerator interconnects
  • Design and scale observability systems for cluster health, capacity, performance, failures, and workload behavior
What You Need
  • Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems
  • Deep Linux systems experience including debugging performance, networking, storage, processes, and kernel-level issues
  • Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems
  • Strong automation skills using Terraform, Ansible, Helm, Python, Go, or equivalent
  • Experience with GPU or accelerator infrastructure including drivers, firmware, CUDA/ROCm stacks, or hardware validation
Nice to Have
  • Experience building or operating AI inference, training, HPC, or neocloud infrastructure
  • Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up
  • Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting
  • Experience building observability platforms using Prometheus, Open Telemetry, Grafana, or similar technologies
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary