Senior Site Reliability Engineering Specialist Job London area,Southwestern Ontario Ontario Canada,IT/Tech

Location: Southwestern Ontario

Senior Site Reliability Engineering Specialist

Join to apply for the Senior Site Reliability Engineering Specialist role at SAP.

Hybrid Work Arrangement

This is a hybrid role based out of Waterloo. Hybrid is 3 days a week onsite and 2 days a week remote.

We help the world run better
At SAP, we keep it simple: you bring your best to us, and we'll bring out the best in you. We're builders touching over 20 industries and 80% of global commerce, and we need your unique talents to help shape what's next. The work is challenging – but it matters. You'll find a place where you can be yourself, prioritize your wellbeing, and truly belong.

What's in it for you? Constant learning, skill growth, great benefits, and a team that wants you to grow and succeed.

Role Overview

As a Senior Site Reliability Engineer in Supply Chain Management (SCM) – Make & Deliver, you will ensure that SAP Digital Manufacturing and SAP Logistics Management operate reliably and efficiently se solutions support critical manufacturing and logistics processes worldwide, built on SAP BTP, Kubernetes, and multicloud environments. In this role, you act as an Enablement Advocate within the organization: partnering with development teams to review architecture for resiliency, enforce reliability guardrails, and integrate observability and performance standards into the design process.

Beyond operational excellence, you will also help develop and integrate AIOps tools for smarter monitoring and automated remediation, ensuring reliability is embedded across the lifecycle. You’ll contribute to incident response for high severity events and drive automation that reduces complexity, enabling teams to deliver services that meet reliability goals by default.

What You’ll Do

Define and maintain SLIs/SLOs for critical services; apply error budgets to guide release decisions.
Collaborate with development teams to embed resiliency patterns and reliability guardrails into architecture and code.
Contribute to incident response for high severity events; support root cause analysis and post-incident improvements.
Establish and evolve observability standards (logging, metrics, tracing) and build actionable dashboards and alerts.
Drive performance and scalability improvements through load testing, capacity planning, and CI/CD performance gates.
Automate operational tasks using Infrastructure-as-Code (Terraform/Helm), pipelines, and scripts to reduce toil.
Advance AIOps capabilities for anomaly detection, smarter alerting, and faster remediation.
Partner across teams to provide guidance, reviews, and golden paths for reliability by default.

TECH YOU’LL USE (DAY TO DAY)

Cloud & Platform:
Kubernetes, Docker, SAP BTP, AWS/Azure services.
Automation & Development: CI/CD pipelines (Git Hub Actions / Azure Dev Ops), Infrastructure as Code (Terraform/Helm), scripting, and integration into dev workflows.
Observability:
Logging, metrics, tracing tools;
Dynatrace, Kibana/Elastic, Prometheus, Open Telemetry.
Data & Messaging:
Confluent Kafka, SAP HANA
Performance Testing:
Load and stress testing tools (e.g., JMeter, k6).
Languages:

Type Script, Python, Bash, Java.

What You’ll Bring

6-10+ years in SRE, Dev Ops, or production operations for distributed systems.
Proven experience with incident response and root cause analysis for high severity events.
Strong skills in observability, performance engineering, and automation.
Hands on expertise in Kubernetes cluster management and troubleshooting.
Ability to model load, run stress tests, analyze bottlenecks, and plan capacity.
Proficiency in CI/CD and Infrastructure as Code, with ability to influence development practices.
Excellent collaboration and communication skills to partner with development and product teams.

NICE TO HAVE

Familiarity with AIOps concepts (AI‑driven anomaly detection, predictive alerting, automated remediation).
Hands‑on experience with LLM Agents frameworks (e.g. Lang Graph or similar) for automation or reliability tooling.
Certifications in Kubernetes, SAP BTP, or Dynatrace.
Experience with the manufacturing domain.

EDUCATION & WORK STYLE

Bachelor’s degree in computer science, Engineering, or equivalent experience.
Curious,…


Increase/decrease your Search Radius (miles)



Job Posting Language