Site Reliability Engineer; SRE Job Vancouver area,BC Canada,IT/Tech

Position: Staff Site Reliability Engineer (Staff SRE)

Staff Site Reliability Engineer (Staff SRE)

Walt Disney Animation Studios’ world‑class filmmakers, artists, and technical collaborators create the magic of animation. Bring your unique talents, passion and ideas to our team and prepare to play in a creative, artist‑friendly environment. We are seeking a Staff SRE with expertise in systems administration on Linux platforms, software development (Python, Go, Java, Node), CI pipeline tools (Jenkins), Git source management, cloud hosting (AWS, GCP, Azure), container computing (Docker, OCI) and web technologies.

The ideal candidate will enjoy the diversity and challenges of working at various levels in the foundational deployment stack, from configuration management to developing CI/CD infrastructure and processes.

This role resides within the Platform and Infrastructure team at Walt Disney Animation Studios (WDAS). We build the tools and manage the infrastructure that artists use daily to create our celebrated animated content. The SRE team focuses on optimizing service deployments and improving availability, latency, performance, efficiency, and observability of systems jects aim for simple, performant solutions to complex problems using Agile and Dev Ops methodologies.

Critical to success in this role is an aptitude for working collaboratively with a technical team. You will develop and drive requirements and strategies while supporting services and core services infrastructure. Our studio thrives from a variety of technical backgrounds and experiences, so we encourage applicants even if they have experiences not specified below.

Responsibilities

As a Staff SRE, you will translate ideas into tangible products that shape experiences by focusing on automation, resiliency, efficiency, stability, security, performance, capacity management, and documentation. You will serve as a subject‑matter expert in multiple areas and be the “go‑to” individual for SRE principles and best practices. You will continuously improve reliability aspects for our services, with a focus on SLIs and SLOs, raising reliability for large‑scale user‑facing and internal services.

You will maintain a strong understanding of stakeholder workflows and translate targeted solutions into end‑to‑end architectural designs.

Support on‑premises and cloud deployments using infrastructure‑as‑code, self‑healing, and security automation patterns.
Deploy and manage deployments across environments.
Develop telemetry, alerts, and automated responses to reduce MTTR.
Collaborate and provide technical excellence within and across teams.
Consult on best practices and develop tools to enable smooth adoption of service reliability practices and methods.
Identify improvement areas in reliability, efficiency, and operations.
Build tools to help the SRE team quickly pinpoint, isolate and resolve infrastructure, platform and application issues.
Refine monitoring processes, configurations, and thresholds.
Promote sustainable incident response and blameless post‑mortems.
Develop runbooks and tools to streamline processes and shorten problem resolution time.
Write code that improves scalability, performance, maintainability, and security.
Maintain alert configurations and documentation as needed.
Improve CI/CD processes to increase release cadence and success.
Apply Chaos Engineering principles and methodologies to test under real‑world conditions.
Mentor SREs, sysadmins and systems engineers in technical and non‑technical SRE responsibilities.

Required Education

BS in Computer Science, Computer Engineering, Electrical Engineering or a related field.

Key Qualifications

7+ years of experience in SRE, Dev Ops, technical operations, systems engineering, software engineering or related discipline.
Proficient, collaborative, and experienced in building reliable, scalable enterprise systems.
Excellent communication skills, both verbal and written.
Passionate and curious about leveraging technology while continuously learning.
Skilled in container orchestration (Docker, Kubernetes, Rancher, AWS ECS/EKS) in production environments.
Experience with configuration management and infrastructure‑as‑code (Terraform, Helm, Cloud Formation,…


Increase/decrease your Search Radius (miles)



Job Posting Language