×
Register Here to Apply for Jobs or Post Jobs. X

System Engineer, GPU Fleet

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Fluidstack
Full Time position
Listed on 2026-02-07
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing
Salary/Wage Range or Industry Benchmark: 200000 - 300000 USD Yearly USD 200000.00 300000.00 YEAR
Job Description & How to Apply Below

Overview

About Fluidstack:
At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises to unlock compute at the speed of light. We’re working with urgency to make AGI a reality and are looking for motivated individuals who are committed to delivering world-class infrastructure.

If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next.

Role

As a System Engineer, GPU Fleet, you will manage, operate, and optimize hyperscale GPU compute infrastructure supporting AI/ML training and inference workloads. Ensure high availability, performance, and reliability of the GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and datacenter operations.

Responsibilities
  • Operate and maintain a large-scale GPU server fleet (H100, B200, GB200) supporting AI/ML workloads; monitor system health, performance, and utilization to maximize uptime and ensure SLA compliance.
  • Perform hands-on troubleshooting and root cause analysis of complex hardware, firmware, OS, and application issues across GPU clusters; coordinate with vendors and hardware teams to resolve systemic failures.
  • Develop and maintain automation scripts for provisioning, configuration management, monitoring, and remediation at scale.
  • Build and improve tooling for GPU health checks, performance diagnostics, driver validation, and automated recovery.
  • Execute server provisioning, configuration, firmware updates, and OS installation using automation frameworks; manage lifecycle operations including deployment, maintenance, and decommissioning.
  • Participate in 24x7 on-call rotation; respond to production incidents and coordinate resolution with cross-functional teams including datacenter operations, network engineering, and application teams.
  • Lead post-incident reviews, document root causes, and drive continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency.
Basic Qualifications
  • Bachelor's degree in Computer Science, Engineering, or related technical field (or equivalent practical experience).
  • 3+ years (System Engineer) or 5+ years (Senior System Engineer) in Linux system administration, datacenter operations, or infrastructure engineering.
  • Strong Linux/Unix fundamentals including system administration, shell scripting (Bash, Python), troubleshooting, and performance tuning.
  • Experience with server hardware architecture, troubleshooting techniques, and understanding of compute, memory, storage, and networking components.
  • Experience in automation and configuration management tools (Ansible, Puppet, Chef, Terraform).
  • Strong analytical and problem-solving skills with ability to diagnose complex technical issues under pressure.
  • Excellent communication and collaboration skills; ability to work effectively with cross-functional teams.
Preferred Qualifications
  • Experience managing large-scale GPU infrastructure (NVIDIA H100, A100, B200, GB200) in production environments supporting AI/ML workloads.
  • Deep knowledge of GPU architecture, CUDA toolkit, GPU drivers, monitoring tools (nvidia-smi, DCGM).
  • Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and container orchestration (Kubernetes, Docker).
  • Proficiency in out-of-band management protocols (IPMI, Redfish, BMC) and firmware management for server hardware.
  • Experience with high-performance networking (Infini Band, RoCE, RDMA) and network troubleshooting in GPU cluster environments.
  • Familiarity with datacenter operations including rack installations, cabling, power management, and thermal considerations.
Salary & Benefits
  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.

The base salary range for this position is $200,000 - $300,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary