Senior SRE/Platform Engineer
Listed on 2026-06-20
-
IT/Tech
SRE/Site Reliability, Cloud Computing: Infrastructure & Operations, Systems Engineer
- Toronto
- Canada
- Technology
- Full time
- 6/1/2026
- J
the company is where you can power your possible. If you want to achieve your true potential, chart new paths, develop new skills, collaborate with bright minds, and make a meaningful impact, we want to hear from you.
Synopsis of the roleSite Reliability Engineering (SRE)/Platform Engineering at the company is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to the company engineering principles.
SRE is also an engineering approach to building and running production systems – we engineer solutions to operational problems. Our SREs are responsible for overall system operation and we use a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless postmortems, proactive identification, and prevention of potential outages.
Our SRE culture of diversity, intellectual curiosity, problem solving and openness is key to its success. The company brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big, and take risks in a blame‑free environment. We promote self‑direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn, grow and take pride in our work.
you’ll do
- Kubernetes Management:
Design, provision, and manage hardened, secure, cost‑optimized GKE and AWS EKS production clusters. - Infrastructure as Code:
Standardize automated, cross‑cloud infrastructure delivery utilizing Terraform. - Git Ops CD:
Maintain a Git Ops model via ArgoCD to match environment state directly to code repositories. - Deployment Strategies:
Execute Canary deployments (online, live‑traffic validation) and Blue‑Green deployments (offline/batch, zero‑downtime, instant rollback). - Cloud Networking:
Architect complex topologies including VPCs, Shared VPCs, Peering, Transit Gateways, and Cloud Interconnect/Direct Connect. - Security & Connectivity:
Manage cross‑cloud connectivity and enforce zero‑trust network policies within Kubernetes. - Observability:
Implement end‑to‑end distributed tracing and infrastructure monitoring using Data Dog. - Telemetry & Alerting:
Build custom dashboards, monitors, and SLO/SLI alerts for deep visibility into app and infra health. - Architectural Partnership:
Translate Enterprise Architects' high‑level blueprints into automated, scalable, and secure technical implementations. - Fin Ops Governance:
Drive AWS/GCP/Azure cost‑saving (rightsizing, Spot/Preemptible instances, storage tiers) and automated governance (tagging, lifecycle policies, budget alerts).
- Professional
Experience:
Requires 7–10+ years of enterprise‑scale experience in Platform Engineering, Site Reliability Engineering (SRE), or Dev Ops - Multi‑Cloud Ecosystems:
Proven mastery managing production‑grade environments across AWS and Google Cloud (GCP), plus Azure experience specifically for cost governance - Deep Kubernetes Expertise: 4+ years of hands‑on experience provisioning and managing EKS and GKE clusters, including production upgrades, hardening, and namespace isolation
- Infrastructure as Code (IaC):
Advanced proficiency with Terraform for multi‑cloud resource provisioning, utilizing modular, reusable code and state management. - Git Ops & CI/CD Automation:
Experience building declarative workflows using ArgoCD or Flux, alongside automated pipelines that integrate security scanning, testing, and validation. - Advanced Deployment Strategies: A proven track record of executing Canary deployments for high‑traffic online services and Blue‑Green deployments for large‑scale batch/offline workloads.
- Multi‑Cloud Networking & Zero‑Trust Security:
Expertise in hybrid architectures (Transit Gateways, Shared VPCs, Direct Connect/Cloud Interconnect) combined with Kubernetes Network Policies and cloud IAM management. - Observability & Reliability:
Hands‑on experience with Data Dog APM for…
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: