Senior AI Platform Engineer — L2/L3 Operations; VMs & OpenShift
Listed on 2026-01-01
-
IT/Tech
Systems Engineer, Cloud Computing
Senior AI Platform Engineer – L2/L3 Operations (VMs & Open Shift)
Direct message the job poster from Astek Middle East.
We are seeking a Senior AI Platform Resident Engineer to lead L2/L3 operations, reliability, and production readiness for enterprise AI platform components deployed across virtual machines and Open Shift environments
.
This role is highly operational and hands‑on, focused on stability, observability, scalability, and security of AI runtime services including model inference, vector databases, messaging, and conversational platforms. You will play a key role in closing operational gaps, defining runbooks, and ensuring reliable service delivery in a restricted, on‑premises environment
.
- Operate and support LLM inference services (e.g., vLLM) across VMs and Open Shift
- Support Qdrant (vector search), Kafka, and Rasa in production environments
- Implement performance tuning, scaling strategies, security hardening, and observability
- Develop L2 operational runbooks and define clear L3/vendor escalation paths
- Manage Kafka and Redis clusters with high availability
- Perform tuning, capacity planning, backup/restore, and failure recovery
- Monitor throughput, latency, and resource utilization
- Deploy, manage, and harden services on VM‑based platforms and Open Shift clusters
- Apply RBAC, TLS, audit logging, resource quotas, autoscaling, and health checks
- Support CI/CD rollouts and standardize deployment and release processes
- Build and maintain metrics, logs, alerts, dashboards, and SLO/SLA monitoring
- Lead incident response,
root cause analysis (RCA), and post‑incident reviews - Execute disaster recovery (DR) testing and resilience validation
- Identify L2 capability gaps and deliver structured operational training
- Define SLOs, RPO/RTO
, escalation workflows, and production readiness checklists - Improve documentation and operational maturity across teams
- Postgre
SQL and Mongo
DB are out of L2 scope and handled by other teams
- 7+ years operating distributed systems in production environments
- 3+ years hands‑on experience with Open Shift and/or Kubernetes
- Strong expertise in Linux, networking, observability, and security hardening
- Experience supporting Kafka, Redis, Qdrant, Rasa, or LLM inference frameworks
- Proven experience in L2/L3 support
, incident management, and escalation handling
Location:
Riyadh, Saudi Arabia. Seniority level:
Mid‑Senior.
Employment type:
Full‑time. Job function:
Information Technology and Engineering.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).