Asset & Wealth Management,Senior Site Reliability Engineer,Executive Director Job Singapore area,Singapore,IT/Tech

Overview

We are seeking a seasoned Site Reliability Engineer who excels at incident response and management with a strong emphasis on escalation discipline and crisp, audience-appropriate communications.

You will partner closely with front‑office trading desks, engineering and fellow SRE colleagues, and Application Business Operations (ABO) to enhance desk readiness, reduce manual workload through strategic automation and AI, and raise the bar on observability, capacity, and change quality across globally distributed systems. This role includes stewardship of cross‑region handoffs, governance of error budgets, and the establishment of clear SRE KPIs to demonstrate value and drive continuous improvement.

Key Responsibilities

Incident Command, Escalation, and Communications
- Act as Incident Commander for high‑severity events, ensuring timely escalation, resolver engagement, and transparent communications to technical and business stakeholders.
- Maintain consistent status updates, incident timelines, and customer/leadership communications; improve comms templates and runbooks for clarity and speed.
- Drive post‑incident reviews with a blameless, learning‑first approach; produce actionable remediation items, owners, and due dates.
Cross‑Region Handoffs and Desk Readiness
- Own the cross‑region handoff procedure to ensure emerging issues are surfaced globally, with explicit ownership, clear next steps, and desk‑readiness checklists.
- Ensure shift notes, incident context, and risk hot‑spots are consistently captured, discoverable, and actioned.
ABO Partnership and Workload Reduction
- Partner closely with ABO to identify incident/issue trends and patterns; quantify impact and prioritize engineering fixes that remove manual workarounds.
- Provide visibility into ABO workload; elevate when prioritization needed for engineering solutions that reduce toil.
Strategic Automation and AI
- Apply engineering tenets to automate repetitive tasks, codify remediations, and implement self‑healing mechanisms; evaluate and responsibly adopt AI to improve triage, runbook execution, and anomaly detection.
- Track toil reduction and time saved; feed back into prioritization and capacity planning.
Observability, Monitoring, and Alert Quality
- Collaborate with developers to improve instrumentation, SLIs, dashboards, and actionable alerts aligned to firmwide standards and globally consistent tooling.
- Reduce alert noise and increase signal‑to‑noise ratio via better thresholds, aggregation, deduplication, and suppression; validate alert‑to‑action mapping with runbooks and ownership.
- Expand tracing, logging, and metrics coverage to speed detection, triage, and root cause isolation.
SLOs, Error Budgets, and Reliability Governance
- Define and steward SLOs and SLIs across services; implement and manage error budgets with clear policies influencing release velocity and risk acceptance.
- Facilitate data‑driven tradeoffs between feature delivery and reliability; regularly review budget burn with product and engineering.
Capacity Engineering and Scalability
- Drive capacity engineering standards; partner with teams on forecasting, scaling strategies, and reporting (leading indicators, saturation, headroom).
- Work with developers to automate capacity tests, limit management, and scaling actions; ensure predictable behavior under load and graceful degradation.
Change Quality and ORR Gatekeeping
- Oversee change quality across environments; reduce change‑related incidents through pre‑deployment checks, progressive delivery, and canaries.
- Serve as ORR (Operational Readiness Review) gatekeepers to validate observability, runbooks, on‑call readiness, rollback plans, and dependencies before go‑live.
Documentation, Runbooks, and Training
- Review and improve documentation freshness, clarity, and completeness; identify and automate runbook steps with high repeatability.
- Train developers on SRE fundamentals: SLOs/SLIs, error budgets, incident roles, on‑call hygiene, and production‑readiness best practices.
KPIs and Reporting
- Establish, track, and publish SRE KPIs and OKRs to evidence value, including MTTD, MTTA, MTTR, incident frequency and severity distribution, change failure rate, error…

Asset & Wealth Management, Senior Site Reliability Engineer, Executive Director