IB CTO Team - Site Reliability Engineer; SRE - Assistant Vice President Job Cary North Carolina USA,IT/Tech

Position: IB CTO Team - Site Reliability Engineer (SRE) - Assistant Vice President

Job Description:

Job Title IB CTO Team - Site Reliability Engineer (SRE)

Corporate Title Assistant Vice President

Location Cary, NC

Who we are

In short – an essential part of Deutsche Bank’s technology solution, developing applications for key business areas.

Our Technologists drive Cloud, Cyber and business technology strategy while transforming it within a robust, hands‑on engineering culture. Learning is a key element of our people strategy, and we have a variety of options for you to develop professionally. Our approach to the future of work champions flexibility and is rooted in the understanding that there have been dramatic shifts in the ways we work.

Having first established a presence in the Americas in the 19th century, Deutsche Bank opened its US technology center in Cary, North Carolina in 2009. Learn more about us here.

Overview

We are looking for a Site Reliability Engineer (SRE) to join our global team. This role will focus on ensuring the operational health, reliability, performance, and scalability of the CARE platform and multi‑tenant applications, encompassing Global Control Programme(GCP)/on‑prem infrastructure, application deployment, and the underlying CARE services. You will be instrumental in defining and implementing SRE best practices to maintain a highly available and resilient platform.

As a senior IB SRE, you will be crucial in ensuring the continuous operation and improvement of the platform.

What We Offer You

A diverse and inclusive environment that embraces change, innovation, and collaboration
A hybrid working model, allowing for in‑office / work from home flexibility, generous vacation, personal and volunteer days
Employee Resource Groups support an inclusive workplace for everyone and promote community engagement
Competitive compensation packages including health and wellbeing benefits, retirement savings plans, parental leave, and family building benefits
Educational resources, matching gift and volunteer programs

What You’ll Do

Platform Reliability and Performance:
Proactively monitor, troubleshoot, and resolve issues related to platform availability, performance, and capacity on both GCP and on‑prem infrastructure
Operational Excellence:
Develop, implement, and maintain SRE best practices, including incident response, post‑mortems, root cause analysis, and proactive problem prevention
Automation and Tooling:
Drive automation efforts to reduce manual toil across operational tasks, deployment, scaling, and recovery. This includes developing and improving monitoring, alerting, and self‑healing systems
SLI/SLO Management:
Define, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key platform services, working to continuously improve them
Collaboration and Support:
Liaise with application teams (tenants) to understand their operational needs, provide guidance on platform best practices for reliability, capacity planning, and assist with complex troubleshooting
Security and Compliance:
Collaborate with security teams to ensure the platform adheres to security policies and compliance requirements, focusing on operational security aspects

Skills You’ll Need

Strong understanding of SRE principles and practices, including SLOs/SLIs, incident management, post‑mortems, and toil reduction
Deep understanding of GCP services such as GKE, Identity and Access Management or Illiquid Asset Monitization (IAM), identity services, Cloud

SQL, Cloud Monitoring, Cloud Logging, and related operational aspects. Extensive experience with Kubernetes and container orchestration, including configuration, troubleshooting, and performance tuning. Experience with Service Mesh (e.g., Istio) is highly desirable
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Splunk, Google Cloud Monitoring) and defining effective alerts and dashboards
Solid experience with Git and Git Hub, including Git workflow for managing code and deployment tooling such as ArgoCD for deployments and managing application life cycles
Programming/scripting (e.g., Python, Go, Java, Bash) and Infrastructure as Code (e.g. Terraform) experience for automation, tooling development,…