Vice President,Site Reliability Engineering Job London area,Greater London England UK,IT/Tech

Location: Greater London

Vice President – Site Reliability Engineer – London

BNY is seeking a Vice President – Site Reliability Engineer to design, build, deploy, and scale resilient, automated, and centrally managed engineering solutions for Production Services. This role is ideal for a strong full‑stack engineer who combines application development, UI engineering, backend services, infrastructure automation, and production reliability expertise.

Responsibilities

Design, develop, and deploy centralized engineering solutions that improve operational efficiency, reduce toil, and enhance resiliency across Production Services.
Build full-stack applications and internal engineering tools, including backend services, APIs, automation layers, and user-facing interfaces using technologies such as Python, Java, React, or Angular.
Engineer scalable solutions that support central operational use cases such as self‑service tooling, operational dashboards, alert enrichment, incident reduction, service recovery, and workflow automation.
Develop reusable frameworks and components that can be adopted broadly across Production Services teams to standardize and accelerate operational processes.
Automate infrastructure, deployment, configuration, and runtime support activities using tools such as Ansible and Kubernetes.
Define, implement, and continuously improve Service Level Indicators, Service Level Objectives, and service health measures aligned to operational and business priorities.
Build and optimize monitoring, observability, and alerting capabilities using tools such as Prometheus, Grafana, App Dynamics, and Splunk.
Apply AIOps capabilities to improve event correlation, anomaly detection, root cause analysis, predictive insights, and proactive issue prevention.
Partner with engineering, infrastructure, production support, security, and risk teams to ensure developed solutions are secure, scalable, supportable, and aligned to enterprise standards.
Identify manual, fragmented, or repetitive processes across Production Services and convert them into efficient, automated, centrally consumable solutions.

Required Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related technical discipline, or equivalent practical experience.
Strong full‑stack development experience, with hands‑on expertise in Python and Java for backend or service‑layer engineering.
Strong working knowledge of front‑end development using React or Angular, including building interfaces for operational or engineering use cases.
Proven experience designing and deploying end‑to‑end solutions, from application development through production deployment and operational support.
Experience in Site Reliability Engineering, Production Engineering, Dev Ops, Platform Engineering, or similar roles supporting business‑critical applications.
Strong foundation in Linux/Unix systems administration, scripting, troubleshooting, and infrastructure concepts.
Hands‑on experience with Ansible and Kubernetes in enterprise or production environments.
Demonstrated ability to define and operationalize SLIs, SLOs, dashboards, alerts, and health indicators.
Hands‑on experience with enterprise monitoring and observability platforms including Prometheus, Grafana, App Dynamics, and Splunk.
Strong troubleshooting, analytical, and problem‑solving skills in complex distributed or production environments.
Strong verbal and written communication skills, with the ability to collaborate effectively across technical and non‑technical stakeholders.

Preferred Qualifications

Experience building centralized internal platforms or shared engineering services for operational or enterprise users.
Experience applying AIOps, machine learning, or intelligent automation within production support or reliability engineering environments.
Exposure to CI/CD pipelines, infrastructure as code, API‑driven automation, and modern software delivery practices.
Experience supporting distributed systems, cloud‑native platforms, or container‑based architectures.
Knowledge of Agile, Dev Ops, and SRE operating models, including continuous improvement and blameless post‑incident practices.
Ability to influence engineering standards and drive adoption of common tooling and automation patterns across teams.

#J-18808-Ljbffr

Vice President, Site Reliability Engineering