Enterprise Systems Operations Manager
Listed on 2026-06-27
-
IT/Tech
SRE/Site Reliability, Systems Administrator
Location Overview
The Cboe office in London is situated in The Monument Building, a modern space that spans two floors and features a spacious outdoor balcony. Employees can enjoy stunning views of London’s iconic architectural landmarks from the balcony. The building provides convenient amenities such as bike storage and showers, and its prime location in the heart of London’s financial district ensures easy access to a variety of cafés, restaurants, and shops.
It is located directly adjacent to the historic Monument to the Great Fire of London and just across the street from the Monument Underground Station, offering convenient transport links and quick access to the West End.
As the Enterprise Systems Operations Manager at Cboe, you lead the team responsible for the real‑time health of enterprise infrastructure. Your team monitors systems, responds to alerts and logs, owns incident response, and drives operational workflows that keep platforms running reliably across regions. The focus is on restoring performance quickly and preventing recurrence. This role sits within the Enterprise Infrastructure Services organization and reports to the Senior Manager of Enterprise Systems.
A core expectation is to lead the adoption and scaling of AI‑assisted operations, building Claude skills and agentic workflows that accelerate monitoring, triage, incident response, and artefact generation.
- Own the monitoring and observability programme, covering alert coverage, log aggregation, dashboard health, and the full path from detection to resolution.
- Keep alerts accurate, actionable, and correctly routed; reduce noise, false positives, and alert fatigue.
- Oversee log management across enterprise platforms (Windows Server, VMware, Microsoft 365, Azure, data protection), ensuring logs are retained, searchable, and used in investigations.
- Build and scale Claude skills and agentic workflows for the team, spanning alert triage, log summarisation, incident classification, runbook lookup, status updates, drift detection, and Jira ticket creation, and document them for the broader Enterprise Systems team.
- Replace repetitive operational tasks with autonomous agents and measure the impact on response times.
- Own the on‑call programme: scheduling, rotation coverage, escalation paths, and after‑hours standards.
- Lead major incident response across Engineering, Security, and application teams to restore service quickly, then run post‑incident reviews to identify root causes.
- Automate routine work with Power Shell, Graph API, and infrastructure‑as‑code; maintain runbooks and self‑healing scripts for known failures.
- Govern change management: review change requests and coordinate maintenance windows to minimise disruption.
- Own documentation standardisation and clean‑up for Enterprise Systems: build an effective documentation standard and drive the team to align runbooks, procedures, and operational artefacts.
- Manage and develop the operations engineering team, coaching them to use Claude and other AI tools effectively.
- Own the Enterprise Systems support queue operations programme, including Vulnerability Management, SLA tracking, ticket routing/escalation, and leadership reporting.
- Manage vendor relationships for operational tooling, including ITSM, monitoring, log management, and backup/recovery.
- Ensure audit readiness and represent operations in compliance reviews and risk assessments.
- 5+ years in IT or infrastructure operations, including 2+ years leading a team.
- Hands‑on experience with monitoring and observability platforms (Grafana, Loki, Dynatrace, Azure Monitor, or equivalent).
- Working knowledge of log management tooling and using logs actively in incident investigation.
- Solid grasp of ITIL or an equivalent service management framework.
- Experience with ITSM tooling (Jira Service Management, Service Now, or equivalent).
- Familiarity with Windows Server, Microsoft 365, VMware, and Azure.
- Experience running on‑call programmes and leading major incident response.
- Track record of improving operational metrics such as MTTR, SLA compliance, and alert noise.
- Hands‑on experience building Claude skills,…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: