More jobs:
Infrastructure/Cluster Engineer
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-06-15
Listing for:
Linuxcareers
Full Time
position Listed on 2026-06-15
Job specializations:
-
IT/Tech
Systems Engineer, IT Infrastructure, SRE/Site Reliability, Cloud Computing: Infrastructure & Operations
Job Description & How to Apply Below
Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse hardware architectures to enable production AI inference at scale.
What You'll Do- Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference workloads
- Build automation for provisioning, configuration, upgrades, validation, and lifecycle management across heterogeneous bare-metal infrastructure
- Debug complex production issues spanning Linux, networking, storage, drivers, firmware, and orchestration layers
- Build and operate high-performance networking infrastructure including RDMA-enabled environments and accelerator interconnects
- Design and scale observability systems for cluster health, capacity, performance, failures, and workload behavior
- Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems
- Deep Linux systems experience including debugging performance, networking, storage, processes, and kernel-level issues
- Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems
- Strong automation skills using Terraform, Ansible, Helm, Python, Go, or equivalent
- Experience with GPU or accelerator infrastructure including drivers, firmware, CUDA/ROCm stacks, or hardware validation
- Experience building or operating AI inference, training, HPC, or neocloud infrastructure
- Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up
- Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting
- Experience building observability platforms using Prometheus, Open Telemetry, Grafana, or similar technologies
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×