Technical Operations Manager - AI Job Caerphilly area,Wales UK,IT/Tech

Overview

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data‑centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public‑sector organisations.

The Technical Operations Manager is responsible for the implementation and day‑to‑day running of a new, greenfield Technical Operation Centre, encompassing Client Support, SRE / AIOps and Automation. Sitting across traditional Service Desk, SRE / AI Engineering, this new role ensures that Era4’s sovereign AI/HPC infrastructure is supported, monitored and delivered to contracted SLA targets from day one. The best part is that it is yours to build.

You will design and shape the function, embed SLO‑driven thinking, agentic approaches, own escalation pathways, and translate complex infrastructure events into clear customer communications. This is a foundational role with a direct line to leadership and genuine scope to shape how the function operates at scale.

Key Responsibilities

Own the end‑to‑end operational performance of the 24x7 Operations Centre: incident management, change management and problem management.
Serve as the primary escalation point for P1/P2 incidents, providing incident command and coordinating resolution across SRE, Service Desk, Dev Ops partners and third‑party vendors.
Maintain and continuously improve operational runbooks, SOPs and the post‑mortem‑to‑runbook learning pipeline.
Lead regular operational reviews (weekly, monthly) and produce management‑ready performance reports covering MTTR, SLA adherence, error budget consumption and incident trends.
Manage on‑call rotas and escalation schedules across the Operations Centre; coordinate overnight cover and hand‑off procedures.
Own the change advisory process, ensuring all infrastructure changes are risk‑assessed, scheduled and communicated appropriately.
Line‑manage the Service Desk function: own ticket triage workflows, SLA timers, first‑contact resolution targets and customer communication standards within the ITSM.
Champion a customer‑first culture across Service Desk.

SRE / AIOps

Own the SLO/error budget framework at a programme level: hold the team accountable to error budget targets, use burn‑rate data to drive prioritisation decisions and escape when automation investment needs to be throttled or accelerated.
Provide operational context in sprint planning and backlog prioritisation; ensure the SRE team’s roadmap is anchored to customer experience, customer‑impacting risk reduction and compliance milestones, not engineering preference alone.
Manage and develop 3rd‑party integrations, at both a Service and Technical level.

Required Experience & Skills

Comfortable and confident dealing directly with clients, from technical support tickets to service reviews with senior leadership.
Proven background within infrastructure operations, HPC, SRE, NOC, managed services or equivalent mission‑critical environment in a management or senior lead role.
Demonstrated experience across at least two of the three domains: NOC/incident operations, service management (ITSM, SLA governance) and SRE/platform engineering, with sufficient working knowledge of the third to operate effectively as an escalation point.
Working knowledge of observability tooling, Grafana, Prometheus or equivalent; able to read dashboards, interrogate alert logic and hold meaningful conversations with engineering teams and third‑party vendors.
Fluency with SLA/SLO frameworks: designing, implementing and reporting against contractual and internal service targets.
Strong Linux, container and infrastructure knowledge, specifically supporting GPU and HPC workloads in production.

One or More Would Be An Advantage

Operational experience with GPU infrastructure (NVIDIA HGX, DGX, Infini Band) or AI/HPC compute environments.
Familiarity with DCGM Exporter, GPU telemetry or equivalent high‑density compute monitoring.
Experience with integration and automation into ticket platforms (Halo, Service Now, Fresh service or equivalent) and ITIL‑based incident, problem and change management.
Hands‑on experience with Git Lab, Git Ops workflows and infrastructure‑as‑code (Terraform, Ansible or AWX).
Exposure to agentic remediation / AIOps tooling, automated alerting, event correlation or self‑healing runbooks.
Exposure to one or more of Python, Go, Bash, PromQL.
Experience in a data centre, hosting, cloud, colocation, managed hosting or sovereign cloud environment.

Why Join Era4

You’ll be joining a mission‑driven start‑up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy and the chance to shape how a next‑generation company operates at scale.

#J-18808-Ljbffr