HPC Systems Engineer
Listed on 2026-05-31
-
IT/Tech
Systems Engineer, Cloud Computing, IT Support
HPC Systems Engineer
We’re looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you’ll support and evolve a high‑performance Linux cluster used for physics modeling, simulation, algorithm development, and machine‑learning workloads—enabling hundreds of engineers to do their best work every day. You’ll play a key role in driving the reliability, performance, and scalability of a shared, mission‑critical HPC environment, partnering closely with infrastructure, Dev Ops, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!
Responsibilities- HPC Platform Operations: Operate and maintain a large‑scale Linux‑based HPC cluster used for internal R&D workloads;
Manage compute nodes, login nodes, and supporting infrastructure in a multi‑tenant environment;
Monitor cluster health, performance, and capacity; respond to incidents and degradations. - Scheduler & Workload Management: Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent);
Assist users with job submission issues, resource requests, and queue optimization;
Help optimize scheduler policies to balance throughput, fairness, and utilization. - Linux Systems Engineering: Install, configure, and maintain Linux operating systems across compute and service nodes;
Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening;
Troubleshoot complex Linux performance, networking, storage, and process‑level issues. - Performance & Scaling: Support high‑throughput and parallel workloads across CPU and GPU resources;
Diagnose performance bottlenecks across compute, storage, network, and scheduler layers;
Assist with scaling activities such as node expansions, re‑provisioning, and hardware refreshes. - Automation & Reliability: Use automation and configuration management tools to ensure consistency across the cluster;
Contribute to scripting and tooling for node provisioning, validation, and lifecycle management;
Participate in on‑call or escalation rotations as required to support a production R&D platform. - Collaboration & User Support: Partner with internal engineering teams to understand workload requirements and usage patterns;
Provide guidance and best practices for running workloads efficiently on shared HPC systems;
Contribute to internal documentation and operational runbooks.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 3+ years of hands‑on Linux systems administration experience.
- Direct experience working with HPC or large‑scale compute environments.
- Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar).
- Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis).
- Comfort working in CLI‑driven, production infrastructure environments.
- Experience supporting GPU‑accelerated workloads (CUDA, drivers, GPU scheduling concepts).
- Familiarity with parallel computing or scientific/engineering workloads.
- Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or high‑performance NAS/SAN).
- Exposure to automation tools (Ansible, scripting, Infrastructure‑as‑Code concepts).
- Familiarity with containers in HPC contexts (Singularity / Apptainer, rootless containers).
- Experience supporting internal developer or research communities.
Base Pay Range: $ - $ annually. Primary location: USA-MI-Ann Arbor (KLA).
Benefits include medical, dental, vision, life, and other voluntary benefits; 401(k) including company matching; employee stock purchase program (ESPP); student debt assistance; tuition reimbursement; development and career growth programs; financial planning benefits; wellness benefits including an employee assistance program (EAP); paid time off and paid company holidays; and family care and bonding leave. Interns are eligible for some of the benefits listed.
EqualOpportunity Employer
KLA is proud to be an Equal Opportunity Employer. We will ensure that qualified individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us at or to request accommodation.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).