Site Reliability Engineer Job Toronto area,Ontario Canada,IT/Tech

Title:

Site Reliability Engineer

Location:

Toronto, Ontario

Duration: 12 months

Pay range: C49 INC

Years of

Experience:

6-8

We are seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of platform services. The ideal candidate will bring strong expertise in SRE practices, observability, infrastructure automation, and developer platform enablement, with exposure to modern technologies including policy-as-code and emerging GenAI-driven systems.

Key Responsibilities

Implement and manage SRE practices including:
- Incident management, root cause analysis, and postmortems
- Reliability engineering and performance optimization
- Tracking and improving DORA metrics
Define and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Build and manage monitoring, logging, and distributed tracing frameworks
Ensure platform reliability through proactive alerting, observability, and automation
Automate infrastructure and governance using:
- Terraform (Infrastructure as Code)
- Policy-as-Code tools (OPA/Rego, Sentinel)
Enhance developer experience and productivity by:
- Designing self-service platform capabilities
- Managing service catalogs and platform standards
- Building reusable templates and golden paths
- Work with tools like Backstage to enable internal developer platforms
- Collaborate with engineering teams to improve system stability, deployment reliability, and operational efficiency
- Support integration and reliability considerations for GenAI-based systems (RAG, prompt workflows, model evaluation)

Required Skills

Strong experience in SRE practices and reliability engineering
Hands‑on expertise with monitoring/logging platforms and distributed tracing
Experience with SLO/SLI frameworks and observability design
Experience in incident management and performance engineering
Strong understanding of DORA metrics and operational excellence
Proficiency in Terraform (Infrastructure as Code)
Policy as Code (OPA/Rego, Sentinel)
Experience with developer platform tools (Backstage, service catalogs)
Golden paths and platform standardization

Nice to Have

Exposure to GenAI platforms, RAG, and prompt engineering concepts
Experience in developer productivity measurement and platform engineering initiatives

Tools & Methodologies

Experience with Agile methodologies (Jira, Confluence)
Familiarity with Dev Ops and platform engineering practices

Soft Skills

Strong problem‑solving and analytical skills
Ability to work in high‑pressure production environments
Excellent communication and cross‑team collaboration

#J-18808-Ljbffr