More jobs:
Job Description & How to Apply Below
The role is for an HPC Engineer responsible for designing, deploying, managing, and optimizing an on-premises High Performance Computing (HPC) environment.
The environment includes SLURM-managed CPU and GPU clusters .
Strong emphasis on HPC architecture, Linux administration, job scheduling, and cluster operations .
Experience with parallel/distributed storage (WekaFS, Scality) is preferred but optional .
Primary
Skills:
HPC Operations & Cluster Management (CPU & GPU)
SLURM Workload Manager (Mandatory) Install/configure/manage SLURM across multiple clusters
Partitions/queues, fairshare, job priority, scheduling policies
Upgrades, migrations, automation via API/integrations
Linux System Administration (RHEL focus) OS patching, hardening, tuning, package management
Troubleshooting & Performance Optimization Cluster health, node/job failures, bottlenecks, utilization optimization
Parallel Computing Knowledge MPI, OpenMP, distributed execution fundamentals
Secondary Skills (Preferred / Optional):
Storage / Parallel File Systems
WekaFS (preferred optional)
Scality RING / ARTESCA (preferred optional)
GPU Computing Exposure NVIDIA drivers, CUDA familiarity, GPU scheduling concepts
Monitoring Tools Grafana, Prometheus
Automation / Scripting Bash/Python for workflows, tooling, ops automation
HPC Ecosystem Components Infini Band/100G networking, monitoring tools, storage tiering concepts
SLURM-based HPC clusters
Linux (RHEL) administration
Multi-node distributed systems
(Optional) Storage platforms like WekaFS / Scality
Role and Responsibilities:
A.
Key Responsibilities
1) HPC Infrastructure & Operations
Manage day-to-day operations of on-prem CPU & GPU clusters
Monitor health, performance, utilization ; ensure availability & efficiency
Implement best practices for:
HPC operations
user management
resource administration
Troubleshoot:
networking issues
node failures
job failures
performance bottlenecks
User support:
job submissions
resource usage
HPC workflows
2) SLURM Workload Manager (Mandatory)
Configure/install/manage SLURM across multiple clusters
Manage:
queues
partitions
node allocation policies
fair share policies
job prioritization
Handle:
SLURM upgrades
migrations
maintenance activities
Work with SLURM APIs/integrations for:
automation
custom workflows
Optimize scheduling for mixed CPU/GPU workloads
3) Linux System Administration
Administer:
compute nodes
head nodes
admin servers
Perform:
OS updates
package installs
security patching
system tuning
Automate via:
shell scripting (Bash/Python)
4) Parallel Computing & Cluster Architecture
Understand and support workloads using:
MPI
OpenMP
distributed execution
Work with HPC building blocks:
high-speed interconnects (Infini Band/100G)
storage tiers
resource managers
monitoring tools
Diagnose and resolve:
parallel workload performance issues
B. Additional Responsibilities (Optional / Preferred Area)
5) Storage (Optional but Preferred)
A. WEKA (WekaFS)
Manage/tune parallel file system performance
Troubleshoot WekaFS issues with minimal downtime
Provide internal guidance and usage best practices
Track ecosystem improvements & recommend enhancements
B. Scality
Maintain and troubleshoot:
Scality RING
ARTESCA environments
Monitor/tune for high availability & reliability
Create documentation (configuration + SOPs)
Recommend performance improvements based on product enhancements
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×