×
Register Here to Apply for Jobs or Post Jobs. X

Senior SRE: AI Cloud & Kubernetes Platform Lead

Job in Abu Dhabi, UAE/Dubai
Listing for: ConnectsBlue
Full Time position
Listed on 2026-06-03
Job specializations:
  • IT/Tech
    Systems Engineer, Cloud Computing, SRE/Site Reliability, IT Support
Salary/Wage Range or Industry Benchmark: 200000 - 300000 AED Yearly AED 200000.00 300000.00 YEAR
Job Description & How to Apply Below
# Senior SRE: AI Cloud & Kubernetes Platform Lead Core
42

United Arab Emirates Posted 5/30/2026### AI Match Score Based on your profile & preferences 0%
Looking for a
** Senior SRE: AI Cloud & Kubernetes Platform Lead
** role in
** United Arab Emirates**? This position at
** Core
42
* * is an excellent opportunity for professionals looking to advance their career. Browse the full job description below to see if you're a match and add it to your board.###

Job Description Senior Site Reliability Engineer, Core
42 - Abu Dhabi, UAEAbout Us Core
42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally.

Leveraging advanced resources and partnerships, Core
42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs.

With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning ourselves at the forefront of AI innovation in the Middle East and beyond.

The Opportunity As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and operating scalable, reliable, and secure infrastructure to support large-scale AI and HPC workloads.

You will play a key role in building and maintaining CI/CD pipelines, Kubernetes-based environments, and observability systems that ensure high availability and performance across globally distributed platforms.

Working closely with engineering, product, and operations teams, you will drive automation, enforce SRE best practices, and contribute to a resilient and efficient infrastructure ecosystem that supports mission-critical applications.

Your Key Responsiblities

CI/CD & Automation:

Design, build, and maintain robust CI/CD pipelines using tools such as Git Lab CI, Azure Dev Ops, and/or Jenkins to enable rapid and secure software delivery

Kubernetes Operations:

Operate, manage, and optimize Kubernetes clusters, ensuring scalability, performance, and resilience

Infrastructure as Code:

Develop and maintain infrastructure using Terraform, Helm, Ansible, or similar tools to automate provisioning and configuration

Observability & Monitoring:

Implement and manage monitoring solutions using Prometheus, Victoria Metrics, Grafana, and ELK/EFK to ensure system health and performance

Incident Management:

Lead root cause analysis (RCA), post-mortems, and continuous improvement initiatives to enhance system reliability

Reliability Engineering:

Define and implement SRE best practices, including SLAs, SLOs, and error budgets

Logging & Alerting:

Build and maintain logging, alerting, and tracing systems for proactive issue detection and rapid troubleshooting

Security & Compliance:

Enforce security best practices and compliance standards across CI/CD pipelines and runtime environments; support audit readiness

Collaboration:

Work cross-functionally with engineering, product, and infrastructure teams to align platform capabilities with business needs

Mentorship:

Provide guidance and mentorship to junior engineers and contribute to knowledge sharing across teams

On-call Support:

Participate in on-call rotations to support critical platform services

What we're looking for Required Skills/Qualifications Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field5+ years of experience in Dev Ops, Site Reliability Engineering, or platform engineering roles in production environments

Proven experience managing Kubernetes clusters (e.g., GKE, EKS, AKS, or self-managed)
Hands-on experience with CI/CD tools and automation frameworks

Strong experience with infrastructure-as-code tools such as Terraform, Helm, or Ansible Proficiency in container technologies (Docker, containerd) and orchestration with Kubernetes

Experience with observability and monitoring stacks (Prometheus, Grafana, ELK/EFK)
Solid understanding of Linux systems, networking concepts, and cloud-native security best practices

Preferred Skills/ Qualifcations Experience supporting AI/ML or HPC workloads in production environments

Knowledge of GPU resource management, workload schedulers, and performance tuning

Familiarity with distributed systems and large-scale infrastructure environments

Experience with incident management frameworks and reliability engineering practices

Strong collaboration and communication skills across cross-functional teams
#J-18808-Ljbffr
Position Requirements
10+ Years work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary