Dev Ops Engineer
Join M Partners as a Dev Ops Engineer in the dynamic high‑tech landscape of Dubai, UAE. M Partners is the exclusive software partner to one of the world’s largest ODMs in the networking equipment space, developing network operating systems that power critical data centre and telecom routing and switching infrastructure. They have recently launched an AI division focused on designing custom chips to accelerate inference and training workloads.
The company is building a true networking vendor and a thriving ecosystem for embedded systems and ASIC design talent across the MENA region.
Own the end‑to‑end design and operation of on‑premise infrastructure for AI and enterprise workloads—built as code, automated, observable, and secure. Architect and run Kubernetes clusters for training and inference, manage servers, networks, and core services, and enable developers with reliable CI/CD and platform tooling. Your work directly impacts AI velocity at scale.
Responsibilities- Design and operate on‑prem infrastructure as code: author reusable Terraform/Ansible/Helm modules and build Git Ops workflows (e.g., Argo CD) for repeatable, audited changes across environments.
- Build and run Kubernetes for AI: configure multi‑tenant GPU clusters (MIG/GPUDirect RDMA, NVIDIA device plugins/DCGM), scheduling/quotas, HPA/Cluster Autoscaler where applicable, and workload isolation.
- Administer servers, networks, and core services: OS lifecycle (Linux), identity/SSO (Keycloak/LDAP), secrets (Vault), DNS/DHCP/NTP, artifact registries, and internal package mirrors.
- Provide storage for AI pipelines: integrate and operate high‑bandwidth/low‑latency storage, tune for dataset staging and checkpointing patterns.
- Enable CI/CD: partner with developers to design fast, reproducible pipelines (Git Lab CI/Git Hub Actions), caching and runners on GPU/CPU nodes, artifact provenance (SBOM, SLSA).
- Collaborate with platform, ML, silicon, systems, security, application developers, and site ops to turn infrastructure into a product that accelerates the business.
- 5+ years in Dev Ops/SRE/Platform Engineering with hands‑on ownership of on‑prem hardware.
- Proven experience operating Kubernetes in production (multi‑tenant RBAC, networking/CNI).
- Proficiency with IaC and automation (Terraform, Ansible, Helm; Git Ops with Argo CD/Flux).
- Strong Linux administration, scripting (Bash/Python), and troubleshooting across compute, network, and storage stacks.
- CI/CD expertise (Git Lab CI/Git Hub Actions), container build security (SBOM, image signing).
- Solid networking fundamentals (L2/L3, routing, BGP, VLANs, EVPN/VXLAN, load balancing, TLS/mTLS).
- Experience implementing observability (Prometheus/Grafana, logs, tracing) and running incident response.
- GPU cluster operations for AI (NVIDIA drivers/operator, DCGM, MIG, GPUDirect RDMA, Slurm integration).
- Storage for data‑intensive workloads (Ceph, parallel file systems, NVMe‑oF) and performance tuning.
- Secrets/identity platforms (Vault, Keycloak/LDAP/SSO), policy‑as‑code (OPA/Gatekeeper, Kyverno).
- Security/compliance practices (CIS benchmarks, SLSA, supply‑chain scanning) and zero‑trust networking.
- Data centre experience (rack/stack, power/cooling basics) and remote site rollout automation.
- Familiarity with configuration management for network devices and API‑driven switches/routers.
- Reproducible environments: spin up identical dev/test stacks from Git in ≤30 minutes with audit trails for every change.
- Solid CI/CD for AI workflows: deterministic pipelines, cache‑efficient, median pipeline time down 30–50% with artifact provenance.
- Predictable GPU orchestration: fair‑share scheduling, quotas, isolation (MIG/namespace policies) keep queues short; cluster utilisation increases >20%.
- Lab‑to‑cluster continuity: versioned hardware bring‑up images, drivers, firmware promoted through pipelines; new boards/nodes join clusters with push‑button automation.
- Actionable observability: dashboards/alerts reflecting SLOs meaningful to researchers; MTTR 80% routine requests resolved via self‑service workflows.
Recruitment:
Referral program increases interview chances by 2×.
Travel & Visa:
The client can obtain work visas for Dubai, provides flights and visa support; accommodation not provided. Salary flexible per profile.
Location:
Global Village, Dubai, United Arab Emirates.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).