Lead Engineer - Site Reliability Job Abu Dhabi area,UAE/Dubai,IT/Tech

Lead Engineer - Site Reliability

Join Inception to apply for the Lead Engineer - Site Reliability role.

Overview

We are looking for a hands‑on Lead Site Reliability Engineer to own the reliability, observability, and automation of our Azure and hybrid (Azure Stack / on‑prem) platforms. You will lead SRE practices for our AI, data, and application services, drive a cloud‑agnostic Dev Sec Ops toolchain, and partner with engineering, data, and security teams to ensure our platforms are secure, scalable, and cost‑efficient.

This role is ideal for a senior engineer with 10+ years of experience who can combine deep technical expertise with strong leadership and coaching skills.

Inception, a G42 company, is the region’s leading innovator of AI‑powered domain‑specific as well as industry‑agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer – transforming data and compute infrastructure into real‑world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact.

Responsibilities

Own SLOs/SLIs and overall reliability for key Azure and on‑prem platforms (data, AI/ML, and business‑critical applications).
Plan and optimise capacity, performance, and cost for compute, storage, networking, and GPU/accelerator workloads.
Build and maintain observability (metrics, logs, traces, dashboards, alerts) using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms.
Lead automation of infrastructure and operations using Terraform, Bicep, Ansible, and scripting (Python, Power Shell, Bash/Go); drive self‑healing and runbook‑driven operations.
Operate Azure, Azure Stack, and on‑prem Kubernetes/AKS clusters; ensure secure, resilient hybrid connectivity, identity, and access across environments.
Lead P0/P1 incident response, on‑call rotations, communication, and blameless post‑mortems; drive long‑term fixes and reliability improvements.
Use ITSM and Dev Sec Ops tools (e.g., cloud‑agnostic CI/CD, Service Now, Jira, Manage Engine, security scanning and policy‑as‑code) to manage change, incidents, and compliance.
Provide technical leadership and mentoring to SREs and platform engineers; collaborate with data, AI/ML, application, and security teams to design for reliability and security from day one.

Qualifications Skills & Experience

10+ years in SRE/Dev Ops/platform engineering roles, including 5+ years designing and running workloads on Microsoft Azure at scale.
Strong experience with Azure Data and AI services, including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services.
Deep hands‑on skills with containers and Kubernetes (AKS or equivalent), including autoscaling, upgrades, and production operations.
Proficiency with Infrastructure‑as‑Code (Terraform, Bicep, Ansible) and scripting/programming in Python and/or Power Shell (Go/Bash a plus).
Solid understanding of observability practices and tools (metrics, logs, traces) and experience implementing monitoring and alerting in production.
Proven track record implementing SRE practices (SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation).
Familiarity with hybrid networking, identity, and security (Express Route/VPN, private endpoints, Azure AD, key management).
Experience working within Agile/Scrum and ITIL processes; exposure to ISO 27001 and external audits is an advantage.
Excellent communication and stakeholder management skills, with a proven ability to lead, mentor, and influence cross‑functional teams.

What Success Looks Like

99.9%+ availability for core platforms and customer‑facing services.
Fast and predictable incident handling (MTTD).
End‑to‑end observability with meaningful, low‑noise alerting across Azure and on‑prem environments.
Significant reduction in manual toil through automation and self‑service (target ~50% reduction over time).
Documented and tested DR/BCP for key AI, data, and application platforms.

What We Look For

If you are a performance‑driven,…


Increase/decrease your Search Radius (miles)



Job Posting Language