SRE API GW and Microservices Job Abu Dhabi area,UAE/Dubai,IT/Tech

Position: SRE for API GW and Microservices

We re looking for a talented Site Reliability Engineer SRE to keep our systems running smoothly reliably and at scale Through smart automation deep observability and a calm head in a crisis you ll help us balance speed compliance and stability working alongside Dev Ops Cloud Quality Engineering and Product teams to drive continuous improvements in performance security and resilience You ll play a key role in enhancing reliability accelerating delivery and ensuring seamless digital experiences for Banking customers strong

What You Will Be Doing

Define and implement SLIs, SLOs and error budgets for business-critical digital banking services
Build actionable observability metrics, logs, traces, dashboards and alerts using Dynatrace, Prometheus, Grafana and ELK while reducing alert fatigue
Leverage AI-driven insights and anomaly detection with Dynatrace Davis AI or equivalent AIOps platform to proactively predict and resolve reliability issues before impact
Lead incident management from on-call triage and root-cause analysis to blameless postmortems with actionable follow-ups
Improve deployment safety with robust rollout, rollback strategies, canary and blue-green deployments and production readiness reviews
Support and optimize microservices-based architectures ensuring service reliability, scalability and inter-service resilience
Conduct capacity planning, performance tuning and resilience testing, optimizing for both reliability and cost efficiency
Automate operational toil from runbooks and remediation scripts to proactive health checks and self-healing workflows
Collaborate with Dev Ops to embed reliability gates and validations into CI & CD pipelines such as Git Hub Actions, Jenkins, Git Lab CI/CD or Azure Dev Ops
Own and evolve the observability and AIOps stack driving intelligent automation and predictive alerting capabilities
Maintain high-quality documentation, playbooks and operational standards across environments
Ensure operational compliance and security alignment with internal controls and regulatory standards
Analyze system performance, availability and cost data to continually optimize operations
Provide reliability support and escalation guidance for critical production systems during major incidents

Qualifications

5+ years of experience in SRE or Dev Ops roles, building and managing large-scale, high-availability systems across banking, fintech, e-commerce, or other data-intensive digital ecosystems.
Bachelor’s degree in Computer Science or equivalent technical experience.
Strong experience with Linux environments and performance troubleshooting.
Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies.
Proficiency with Kubernetes and container orchestration in microservices environments.
Hands‑on experience with AWS (preferred); exposure to Azure or GCP is an advantage.
Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack.
Experience implementing AI / ML-driven reliability or automation solutions (AIOps, anomaly detection, predictive alerting).
Practical understanding of CI / CD pipelines (Git Hub Actions, Jenkins, Git Lab CI / CD or Azure Dev Ops).
Experience with Kafka, Rabbit

MQ, Redis, Aurora, and RDS databases.
Strong scripting or programming skills in Python, Bash, or Go.

The Ideal Candidate

Organized, structured, and meticulous in approach.
Experienced in cross-functional collaboration and working with distributed teams.
Strong analytical mindset with excellent troubleshooting skills for complex production systems.
Calm and composed communicator under pressure, capable of leading during high-impact incidents.
Proactive problem-solver who anticipates issues and drives preventive improvements.
Passionate about AI-driven automation, observability, and reliability engineering.
Continuously learning, keeping up-to-date with cloud-native, microservices, and SRE best practices.
A collaborative and adaptable team player who thrives in a fast-paced, regulated environment and is passionate about building reliable, scalable systems that empower digital banking innovation.

#J-18808-Ljbffr