Infrastructure/Cluster Engineer

Job in San Francisco, San Francisco County, California, 94199, USA

Listing for: Linuxcareers

Full Time position
Listed on 2026-06-15

Job specializations:

IT/Tech
Systems Engineer, IT Infrastructure, SRE/Site Reliability, Cloud Computing: Infrastructure & Operations

Salary/Wage Range or Industry Benchmark: 120000 - 160000 USD Yearly USD 120000.00 160000.00 YEAR

Position: Infrastructure / Cluster Engineer

Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse hardware architectures to enable production AI inference at scale.

What You'll Do

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference workloads
Build automation for provisioning, configuration, upgrades, validation, and lifecycle management across heterogeneous bare-metal infrastructure
Debug complex production issues spanning Linux, networking, storage, drivers, firmware, and orchestration layers
Build and operate high-performance networking infrastructure including RDMA-enabled environments and accelerator interconnects
Design and scale observability systems for cluster health, capacity, performance, failures, and workload behavior

What You Need

Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems
Deep Linux systems experience including debugging performance, networking, storage, processes, and kernel-level issues
Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems
Strong automation skills using Terraform, Ansible, Helm, Python, Go, or equivalent
Experience with GPU or accelerator infrastructure including drivers, firmware, CUDA/ROCm stacks, or hardware validation

Nice to Have

Experience building or operating AI inference, training, HPC, or neocloud infrastructure
Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up
Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting
Experience building observability platforms using Prometheus, Open Telemetry, Grafana, or similar technologies

#J-18808-Ljbffr

Infrastructure​/Cluster Engineer

Infrastructure/Cluster Engineer