Sr. Hardware Ops Specialist – Data Center GPU; Level 3
Listed on 2026-02-16
-
Engineering
Hardware Engineer, Systems Engineer
Sr. Hardware Ops Specialist – Data Center GPU (Level
3)
This role is 100% on-site. You will be embedded in the data-center floor, participating in an on-call rotation to meet aggressive SLAs. Expect occasional travel between campuses for special projects or new-site launches.
Our client operates hyperscale GPU campuses purpose-built for AI, LLM, and HPC workloads. With several new halls coming online this year, they are hiring six senior hands‑on hardware experts—four in Buffalo, one in Houston, and one in West Texas—to keep thousands of NVIDIA‑powered servers running 24 × 7. If you can diagnose any server failure by sight, sound, or smell, and you thrive on the buzz of live production floors, this is your chance to own the physical backbone of the AI revolution.
What You’ll Do- Own the rack from rail to NIC. Rack, cable, power‑on, and burn‑in GPU servers, network switches, and storage nodes, logging every asset change in the CMDB.
- Be the first‑responder. Triage and resolve hardware or Layer 1/2 network incidents, escalating to remote engineering SMEs only for code‑level fixes.
- Swap anything. Replace DIMMs, GPUs, SSDs, PSUs, fans, and NICs—even if you have never seen the exact failure mode before—and validate fixes with diagnostic tools.
- Maintain uptime. Execute structured change windows, follow ESD and OSHA safety practices, and document each action for audit and compliance.
- Prevent before it breaks. Run capacity checks, preventive maintenance, and inventory audits, ensuring zero surprise outages.
- Rotate and collaborate. Work a 24 × 7 shift rotation with on‑call, coordinating with facilities, network, and vendor partners on expansions and retrofits.
- 3+ years in data‑center or large‑scale lab operations with direct GPU‑server troubleshooting (BIOS, BMC/IPMI, PXE, firmware flashing, etc.).
- Hands‑on experience with NVIDIA accelerators (H100, B200, A100, or similar).
- Solid Linux CLI and basic scripting to automate diagnostics or asset updates.
- Working knowledge of copper and fiber cabling, switches, and optics.
- Proven ability to resolve unfamiliar hardware faults independently.
- Ability to lift 50 lbs, climb ladders, and work safely in hot/cold aisles.
- Technical diploma or certifications in electrical, mechanical, or IT disciplines.
- Vendor‑management or project‑coordination experience in a hyperscale build‑out.
- Competitive base salary (DOE & location).
- Equity participation in a fast‑growth AI‑infrastructure company.
- Comprehensive medical, dental, and vision coverage.
- Retirement plan with company match.
- Generous PTO plus paid holidays aligned with local norms.
- Professional development budget and clear technical‑leadership career path.
Blue Signal is an award‑winning, executive search firm specializing in various specialties. Our recruiters have a proven track record of placing top‑tier talent across industry verticals, with deep expertise in numerous professional services. Learn more at bit.ly/46
Gs4yS
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).