Compute Technical Consultant,Onsite; LANL Los Alamos,NM Job Santa Fe area,New Mexico USA,IT/Tech

Position: High Performance Compute Technical Consultant, Onsite (LANL) Los Alamos, NM

High Performance Compute Technical Consultant, Onsite (LANL) Los Alamos, NM

This role has been designed as 'Onsite' with an expectation that you will primarily work from an HPE partner/customer office.

Key Responsibilities

Monitor and maintain system health across large-scale HPC compute, network, and storage infrastructure
Troubleshoot and repair hardware issues on HPC servers and supporting systems
Perform basic Linux system administration tasks as needed
Create, monitor, update, and close support tickets
Perform hardware component replacements using spares
Operate hand tools and low‑power tools for server maintenance
Track and document hardware repairs, part replacements, and returns
Create, update, and maintain site documentation, processes, and workflows
Assist with new system installation and expansion activities
Read system documentation and diagrams to locate components
Collaborate with team members using email, Teams, Slack, and in‑person communication
Participate in on‑call schedule to support 24x7 operations
Maintain tools and workspace in an organized manner

Minimum Qualifications

Ability to obtain a Q Clearance (required)
US Citizenship (required)
Must be able to work onsite 5 days per week in Los Alamos, NM, with additional onsite work for on‑call support. This is not a remote position
Strong mechanical aptitude and comfort using common hand tools (screwdrivers, pliers, wrenches, cable tools, etc.) for assembling, disassembling, and maintaining server hardware and related equipment
Ability to lift up to 50 lbs individually and up to 75 lbs with assistance
Solid understanding of computer hardware components (servers, drives, memory modules, power supplies, cabling, and peripherals)
Proficiency with basic computer operations on Windows and macOS (Mac Book), including OS navigation, file management, and standard productivity tools such as Slack, SharePoint, Microsoft Office (Word, Excel, Outlook, and Teams)

Preferred Qualifications

Associate's degree, some college, or technical training (BS preferred)
2+ years of Linux System Administration Experience, including strong command‑line navigation, log analysis and monitoring (journalctl, syslog, log files), troubleshooting system and application issues, and scripting/automation using Bash or Python.
Experience using Redfish (along with IPMI) for out‑of‑band server hardware management and monitoring. This includes utilizing the Redfish RESTful API for querying system health, power/thermal monitoring, firmware inventory, component status (processors, memory, drives, NICs), event logs, and performing actions such as system resets, power control, and BIOS configuration.
2+ years of hands‑on experience troubleshooting and maintaining server hardware in a datacenter environment, including diagnosing hardware faults (power, thermal, storage, networking), performing component replacements (drives, memory, CPUs, PSUs, HBAs, NICs), rack mounting/decommissioning servers, and managing cable infrastructure
1+ year of experience with high‑speed networking concepts and troubleshooting for Ethernet, HPE Slingshot, and Infini Band fabrics, including link diagnostics, performance tuning, cable/fiber management, switch configuration, and fault isolation in large‑scale HPC environments.
Previous experience in a 24x7 production support environment
Strong troubleshooting and problem‑solving skills with the ability to work independently, including systematically diagnosing complex hardware, software, and network issues through log analysis, debugging tools, and root cause analysis while minimizing downtime in high‑availability environments
Experience reading technical diagrams, schematics, and working with ticketing systems
Experience with Git for version control of code, scripts, configuration files, and documentation (including cloning, branching, committing, merging, and resolving conflicts)
Experience with High‑Performance Computing (HPC) systems, clusters, or large‑scale AI infrastructure
Experience with large‑scale storage systems, including installation, configuration, monitoring, and troubleshooting of parallel file systems, enterprise SAN/NAS solutions, object storage, and…

Compute Technical Consultant, Onsite; LANL Los Alamos, NM