Lead Engineer - Site Reliability
Join Inception to apply for the Lead Engineer - Site Reliability role.
OverviewWe are looking for a hands‑on Lead Site Reliability Engineer to own the reliability, observability, and automation of our Azure and hybrid (Azure Stack / on‑prem) platforms. You will lead SRE practices for our AI, data, and application services, drive a cloud‑agnostic Dev Sec Ops toolchain, and partner with engineering, data, and security teams to ensure our platforms are secure, scalable, and cost‑efficient.
This role is ideal for a senior engineer with 10+ years of experience who can combine deep technical expertise with strong leadership and coaching skills.
Inception, a G42 company, is the region’s leading innovator of AI‑powered domain‑specific as well as industry‑agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer – transforming data and compute infrastructure into real‑world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact.
Responsibilities- Own SLOs/SLIs and overall reliability for key Azure and on‑prem platforms (data, AI/ML, and business‑critical applications).
- Plan and optimise capacity, performance, and cost for compute, storage, networking, and GPU/accelerator workloads.
- Build and maintain observability (metrics, logs, traces, dashboards, alerts) using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms.
- Lead automation of infrastructure and operations using Terraform, Bicep, Ansible, and scripting (Python, Power Shell, Bash/Go); drive self‑healing and runbook‑driven operations.
- Operate Azure, Azure Stack, and on‑prem Kubernetes/AKS clusters; ensure secure, resilient hybrid connectivity, identity, and access across environments.
- Lead P0/P1 incident response, on‑call rotations, communication, and blameless post‑mortems; drive long‑term fixes and reliability improvements.
- Use ITSM and Dev Sec Ops tools (e.g., cloud‑agnostic CI/CD, Service Now, Jira, Manage Engine, security scanning and policy‑as‑code) to manage change, incidents, and compliance.
- Provide technical leadership and mentoring to SREs and platform engineers; collaborate with data, AI/ML, application, and security teams to design for reliability and security from day one.
- 10+ years in SRE/Dev Ops/platform engineering roles, including 5+ years designing and running workloads on Microsoft Azure at scale.
- Strong experience with Azure Data and AI services, including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services.
- Deep hands‑on skills with containers and Kubernetes (AKS or equivalent), including autoscaling, upgrades, and production operations.
- Proficiency with Infrastructure‑as‑Code (Terraform, Bicep, Ansible) and scripting/programming in Python and/or Power Shell (Go/Bash a plus).
- Solid understanding of observability practices and tools (metrics, logs, traces) and experience implementing monitoring and alerting in production.
- Proven track record implementing SRE practices (SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation).
- Familiarity with hybrid networking, identity, and security (Express Route/VPN, private endpoints, Azure AD, key management).
- Experience working within Agile/Scrum and ITIL processes; exposure to ISO 27001 and external audits is an advantage.
- Excellent communication and stakeholder management skills, with a proven ability to lead, mentor, and influence cross‑functional teams.
- 99.9%+ availability for core platforms and customer‑facing services.
- Fast and predictable incident handling (MTTD).
- End‑to‑end observability with meaningful, low‑noise alerting across Azure and on‑prem environments.
- Significant reduction in manual toil through automation and self‑service (target ~50% reduction over time).
- Documented and tested DR/BCP for key AI, data, and application platforms.
If you are a performance‑driven,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).