×
Register Here to Apply for Jobs or Post Jobs. X

Lead Engineer - Site Reliability

Job in Abu Dhabi, UAE/Dubai
Listing for: Inception
Full Time position
Listed on 2026-01-01
Job specializations:
  • IT/Tech
    Cloud Computing, Systems Engineer
Salary/Wage Range or Industry Benchmark: 200000 - 300000 AED Yearly AED 200000.00 300000.00 YEAR
Job Description & How to Apply Below

Lead Engineer - Site Reliability

Join Inception to apply for the Lead Engineer - Site Reliability role.

Overview

We are looking for a hands‑on Lead Site Reliability Engineer to own the reliability, observability, and automation of our Azure and hybrid (Azure Stack / on‑prem) platforms. You will lead SRE practices for our AI, data, and application services, drive a cloud‑agnostic Dev Sec Ops  toolchain, and partner with engineering, data, and security teams to ensure our platforms are secure, scalable, and cost‑efficient.

This role is ideal for a senior engineer with 10+ years of experience who can combine deep technical expertise with strong leadership and coaching skills.

Inception, a G42 company, is the region’s leading innovator of AI‑powered domain‑specific as well as industry‑agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer – transforming data and compute infrastructure into real‑world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact.

Responsibilities
  • Own SLOs/SLIs and overall reliability for key Azure and on‑prem platforms (data, AI/ML, and business‑critical applications).
  • Plan and optimise capacity, performance, and cost for compute, storage, networking, and GPU/accelerator workloads.
  • Build and maintain observability (metrics, logs, traces, dashboards, alerts) using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms.
  • Lead automation of infrastructure and operations using Terraform, Bicep, Ansible, and scripting (Python, Power Shell, Bash/Go); drive self‑healing and runbook‑driven operations.
  • Operate Azure, Azure Stack, and on‑prem Kubernetes/AKS clusters; ensure secure, resilient hybrid connectivity, identity, and access across environments.
  • Lead P0/P1 incident response, on‑call rotations, communication, and blameless post‑mortems; drive long‑term fixes and reliability improvements.
  • Use ITSM and Dev Sec Ops  tools (e.g., cloud‑agnostic CI/CD, Service Now, Jira, Manage Engine, security scanning and policy‑as‑code) to manage change, incidents, and compliance.
  • Provide technical leadership and mentoring to SREs and platform engineers; collaborate with data, AI/ML, application, and security teams to design for reliability and security from day one.
Qualifications Skills & Experience
  • 10+ years in SRE/Dev Ops/platform engineering roles, including 5+ years designing and running workloads on Microsoft Azure at scale.
  • Strong experience with Azure Data and AI services, including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services.
  • Deep hands‑on skills with containers and Kubernetes (AKS or equivalent), including autoscaling, upgrades, and production operations.
  • Proficiency with Infrastructure‑as‑Code (Terraform, Bicep, Ansible) and scripting/programming in Python and/or Power Shell (Go/Bash a plus).
  • Solid understanding of observability practices and tools (metrics, logs, traces) and experience implementing monitoring and alerting in production.
  • Proven track record implementing SRE practices (SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation).
  • Familiarity with hybrid networking, identity, and security (Express Route/VPN, private endpoints, Azure AD, key management).
  • Experience working within Agile/Scrum and ITIL processes; exposure to ISO 27001 and external audits is an advantage.
  • Excellent communication and stakeholder management skills, with a proven ability to lead, mentor, and influence cross‑functional teams.
What Success Looks Like
  • 99.9%+ availability for core platforms and customer‑facing services.
  • Fast and predictable incident handling (MTTD).
  • End‑to‑end observability with meaningful, low‑noise alerting across Azure and on‑prem environments.
  • Significant reduction in manual toil through automation and self‑service (target ~50% reduction over time).
  • Documented and tested DR/BCP for key AI, data, and application platforms.
What We Look For

If you are a performance‑driven,…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary