Site Reliability Engineer; hybrid or remote Job Toronto Ontario Canada,IT/Tech

Position: Staff Site Reliability Engineer (hybrid or remote)
Our Site Reliability Engineering team sits at the intersection of software engineering and operations, building reliable, scalable cloud systems that our teams and customers can trust.
As Staff Site Reliability Engineer , you'll play a critical role in the management and advancement of our global infrastructure. You'll leverage approximately 15 years of technical expertise - specifically focusing on the evolution of high-concurrency, distributed systems , and the orchestration of hyper-scale cloud environments . In this position, you will leverage your expertise to architect our GCP/GKE environment and lead the integration of AI-driven workflows .

This includes utilizing bots, automated PR remediation, and intelligent alerting to ensure our platform can scale efficiently and reliably.
Why you'll love this role:
Lead high-impact initiatives that shape how millions of people experience work around the world.
Bring your unique perspective to complex and challenging projects - apply your expertise in architecture, influence technical direction, and mentor fellow team members.
Join a close-knit, no-ego, high-performing teamthat solves meaningful problems and celebrates successes together.
Work alongside an experienced leadership teamwho is genuinely invested in your career growth.
Thrive in afast-paced, high-growth environmentwhereinnovationis encouraged andyour voice truly matters.
How you’ll shape our cloud infrastructure:
Architectural Leadership: Lead the design and ongoing evolution of our global, high-availability infrastructure, focusing on Google Cloud Platform (GCP) and Kubernetes (GKE) .
AI & Automation Strategy: Identify repetitive operational tasks and implement AI-integrated workflows, such as Slack or Teams bots for incident triage, AI-augmented alerting, and automated PR generation to address infrastructure drift.
Cross-Functional Influence: Collaborate with Product, Engineering, and Leadership teams to identify systemic risks, manage complex changes, and define the long-term reliability roadmap.
Infrastructure-as-Code (IaC): Establish and exemplify best practices for Terraform and CI/CD pipelines, empowering development teams to deploy code rapidly and securely.
System Resiliency: Lead high-level initiatives in disaster recovery, multi-region networking, and the design of zero-trust security architectures.
Technical Mentorship: Guide design reviews and promote best practices, enhancing the technical skills and capabilities of the entire SRE organization.
Experience we feel will set you up for success:
The 15-Year Lens: Possess extensive systems engineering experience, with in-depth knowledge of Linux kernels, network protocols (TCP/IP, BGP, DNS), and cloud-native architecture.
GCP Expertise: Demonstrated, hands-on experience in architecting and managing production workloads on Google Cloud Platform and GKE .
AI/Workflow Automation: Practical experience or a strong vision for integrating AI tools and LLMs to automate SRE tasks, documentation, or incident response.
Code Proficiency: Advanced skills in Python or Go , with the ability to develop sophisticated internal tools and automation frameworks.
Observability Mastery: Expert understanding of observability frameworks (such as New Relic, Prometheus, Grafana) to enable data-driven decision-making.
Database Foundations: Deep knowledge of managing relational databases (MySQL, Mongo

DB) munication: Exceptional ability to clearly convey complex technical infrastructure challenges as actionable business insights to non-technical stakeholders.
The Achievers Mindset Disruptive Innovator: Set industry trends by applying emerging technologies like AI to address longstanding infrastructure challenges.
Self-Starter: Maintain a mindset of continuous improvement, always seeking opportunities to automate processes.
Culture of Success: Believe that platform reliability is fundamental to both employee success and customer trust.
Bonus Points Hands-on experience with Service Mesh (Istio) and advanced GCP Networking features, such as Interconnect and Shared VPC.
A proven history of migrating legacy automation systems to modern,…


Increase/decrease your Search Radius (miles)



Job Posting Language