HPC Engineer

Job in Riyadh, Riyadh Region, Saudi Arabia

Listing for: Penta Consulting

Full Time position
Listed on 2026-06-19

Job specializations:

IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, IT Infrastructure

Salary/Wage Range or Industry Benchmark: 120000 - 150000 SAR Yearly SAR 120000.00 150000.00 YEAR

Penta Consulting are a technology service provider and leading outsourced partner helping to deliver professional and managed solutions across EMEA.

We are seeking an experienced Senior Infrastructure HPC Engineer who has personally designed, deployed, configured, and operated every component of a large-scale high-performance computing environment.

Key Responsibilities

Design, deploy, and maintain HPC clusters end-to-end: compute nodes, storage tiers, high-speed networking (Infini Band / RoCE), and management fabric.
Personally, provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle, and GPU fleet health monitoring.
Deploy and manage the full NVIDIA AI Enterprise Suite: install, license, update, and integrate with MLOps pipelines (NeMo, Triton, RAPIDS).
Deploy and operate NVIDIA GPU Operator and Network Operator on Kubernetes to automate driver and CUDA lifecycle, DCGM exporter, and MIG configuration.
Configure and serve NVIDIA NIM inference endpoints; implement NVIDIA Blueprint reference architectures for production AI workloads.
Install, administer, and tune Slurm: partitions, QOS, fair-share policies, node accounting, MPI integration, and Slurm-on-Kubernetes hybrid scheduling.
Bootstrap and operate Kubernetes clusters using kubeadm - including control plane HA, etcd backup, and zero-downtime upgrades.
Administer RHEL / Canonical Ubuntu across all cluster nodes.
Build and maintain CI/CD pipelines (Git Lab CI / Git Hub Actions) for infrastructure provisioning and HPC software delivery.
Profile and tune GPU and CPU workload performance; resolve bottlenecks across hardware, drivers, MPI fabric, and application layers.
Implement cluster monitoring with Prometheus, Grafana, and DCGM; define alerting and capacity planning thresholds.
Enforce security best practices: node hardening, kernel patching, RBAC, and compliance audits across the HPC environment.

#J-18808-Ljbffr