Site Reliability Engineer Job Toronto Ontario Canada,IT/Tech

Get AI-powered advice on this job and more exclusive features.

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our conveniently located offices and collaborative work spaces, provide our team with the freedom and flexibility to work in the way that makes our employees most productive.

About Us

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. We work with industry leaders to transform their supply chains through technology. If you thrive on tackling interesting challenges with continuous learning opportunities, then Tescys could be a good fit for you!

About The Role

We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement.

This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.

Your Responsibilities

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
Innovate relentlessly:
Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability:
Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes
Drive automation:
Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, Git Lab CI/CD) to reduce manual intervention and enable self-healing systems
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity
Be on-call
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience
Implement monitoring, Logging, alerting, and SLA Reporting
Create and maintain technical documentation
Implement, maintain and mature SRE best practices
Lead incidents:
Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users

Requirements

5+ years in Site Reliability, Cloud, or Dev Ops Engineering, ideally in SaaS or large-scale production environments
Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure
Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale
Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar)
Familiarity with CI/CD pipelines and release automation (Git Lab preferred, Jenkins acceptable)
Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards
Experience with incident management, on-call participation, escalation, and structured postmortems
Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics
Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned
Experience with Fedramp compliance is a strong asset
Basic knowledge of Java- or .Net-based development required
Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec

Additional requirements

Escalation on-call…


Increase/decrease your Search Radius (miles)



Job Posting Language