HPC Engineer
Job in
Riyadh, Riyadh Region, Saudi Arabia
Listed on 2026-06-19
Listing for:
Penta Consulting
Full Time
position Listed on 2026-06-19
Job specializations:
-
IT/Tech
Systems Engineer, Cloud Computing: Infrastructure & Operations, IT Infrastructure
Job Description & How to Apply Below
Penta Consulting are a technology service provider and leading outsourced partner helping to deliver professional and managed solutions across EMEA.
We are seeking an experienced Senior Infrastructure HPC Engineer who has personally designed, deployed, configured, and operated every component of a large-scale high-performance computing environment.
Key Responsibilities- Design, deploy, and maintain HPC clusters end-to-end: compute nodes, storage tiers, high-speed networking (Infini Band / RoCE), and management fabric.
- Personally, provision and administer NVIDIA Base Command Manager (BCM) for bare-metal cluster imaging, OS lifecycle, and GPU fleet health monitoring.
- Deploy and manage the full NVIDIA AI Enterprise Suite: install, license, update, and integrate with MLOps pipelines (NeMo, Triton, RAPIDS).
- Deploy and operate NVIDIA GPU Operator and Network Operator on Kubernetes to automate driver and CUDA lifecycle, DCGM exporter, and MIG configuration.
- Configure and serve NVIDIA NIM inference endpoints; implement NVIDIA Blueprint reference architectures for production AI workloads.
- Install, administer, and tune Slurm: partitions, QOS, fair-share policies, node accounting, MPI integration, and Slurm-on-Kubernetes hybrid scheduling.
- Bootstrap and operate Kubernetes clusters using kubeadm - including control plane HA, etcd backup, and zero-downtime upgrades.
- Administer RHEL / Canonical Ubuntu across all cluster nodes.
- Build and maintain CI/CD pipelines (Git Lab CI / Git Hub Actions) for infrastructure provisioning and HPC software delivery.
- Profile and tune GPU and CPU workload performance; resolve bottlenecks across hardware, drivers, MPI fabric, and application layers.
- Implement cluster monitoring with Prometheus, Grafana, and DCGM; define alerting and capacity planning thresholds.
- Enforce security best practices: node hardening, kernel patching, RBAC, and compliance audits across the HPC environment.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×