×
Register Here to Apply for Jobs or Post Jobs. X

Site Reliability Engineer; SRE

Remote / Online - Candidates ideally in
Toronto, Ontario, C6A, Canada
Listing for: Deltatre
Remote/Work from Home position
Listed on 2026-02-15
Job specializations:
  • IT/Tech
    SRE/Site Reliability, Systems Engineer
Job Description & How to Apply Below
Position: Site Reliability Engineer (SRE)
The Site Reliability Engineer (SRE) is responsible for improving the reliability, stability, and operational readiness of critical digital platforms. The role focuses on proactively reducing risk, strengthening system resilience, and enabling product and engineering teams to operate with confidence—particularly during live events, launches, and other high-traffic periods. This role is dedicated to a major downtown Toronto-based client.

The role requires a degree of flexibility to support live operations onsite (in the client’s operations center) and regular on‑call support during evening and weekend live event windows and other key periods. If the requirements will lead to work beyond 44 hours/per week, overtime payment will be granted.

Outside of these event‑driven windows, the role supports flexible and remote working arrangements provided some consistent onsite presence.

The SRE’s will be operating, monitoring, and enhancing the Deltatre OTT platform which is designed to withstand millions of concurrent users, using the latest cutting‑edge technologies. On daily basis, the SRE’s will be innovating, automating, maintaining, and securing our cloud‑based platform. SRE’s will collaborate with other engineering teams, service owners, and support teams to ensure services are highly available and performant.

Key Responsibilities

Improve system availability, performance, and fault tolerance across production environments.

Define, measure, and track Service Level Objectives (SLOs), error budgets, and reliability metrics.

Identify systemic risks and lead initiatives to reduce operational fragility.

Lead or support incident response for high‑severity production issues, particularly during evenings, weekends, and live operations as required.

Establish and refine incident response processes, runbooks, and escalation paths ensuring B2B and Incident Management teams are duly informed and trained on the procedures.

Conduct post‑incident reviews (blameless retrospectives) and ensure follow‑up actions are completed.

Observability & Tooling

Design and maintain monitoring, alerting, and logging strategies that prioritize actionable signals over noise.

Improve visibility into system health to enable faster detection and resolution of issues.

Partner with engineering teams to embed reliability considerations into system design.

Automation & Operational Efficiency

Reduce manual operational effort through automation, tooling, and improved deployment practices.

Improve deployment safety, rollback mechanisms, and change management processes.

Support capacity planning and performance testing.

Requirements
We’re looking for a persistent, hands‑on problem solver who takes ownership from first alert through to permanent resolution. You’ll have practical experience across most of the components in our technology stack and be comfortable operating in live, high‑availability environments.

Core technical experience includes:

Cloud platforms such as  AWS and/or Azure

Containerized workloads  using Docker and OCI‑compliant containers

Mongo

DB  (including monitoring and operating in production) and  Redis

CI/CD pipelines  using tools such as Bamboo, Git Hub, and Octopus

Scripting and automation with  Power Shell and/or bash

Observability and monitoring platforms such as  New Relic and Datadog

Infrastructure as Code using  Terraform and/or Cloud Formation

Strong ability to read, understand, and debug  .NET / C# applications  (a significant advantage, as our backend services are written in C#)

Experience developing or supporting  highly scalable, distributed systems

Hands‑on experience with  microservices architectures , leveraging virtualization and/or containerization

Full‑stack troubleshooting capability , spanning network, application, infrastructure, and distributed services layers

Familiarity with  load and performance testing tools  such as k6, Gatling, or JMeter

We’re looking for someone who is:

driven  to push the boundaries and lead change and performance

communicative  to leave no‑one in the dark and to work with your team successfully

reliable  so we know that we can call on you to meet deadlines

passionate  about the latest technologies and…
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)

Job Posting Language
Employment Category
Education (minimum level)
Filters
Education Level
Experience Level (years)
Posted in last:
Salary