Sr Systems Reliability Engineer Job Glendale area,California USA,IT/Tech

Job Posting

Title:

Sr Systems Reliability Engineer

Job Description:

At Disney, we’re storytellers. We make the impossible, possible. The Walt Disney Company is a world‑class entertainment and technological leader. Walt’s passion was to continuously envision new ways to move audiences around the world—a passion that remains our touchstone in an enterprise that stretches from theme parks, resorts and a cruise line to sports, news, movies and a variety of other businesses.

Uniting each endeavor is a commitment to creating and delivering unforgettable experiences — and we’re constantly looking for new ways to enhance these exciting experiences.

The Enterprise Technology mission is to deliver technology solutions that align to business strategies while enabling enterprise efficiency and promoting cross‑company collaborative innovation. Our group drives competitive advantage by enhancing our consumer experiences, enabling business growth, and advancing operational excellence. As Systems Reliability Engineers (SREs) embedded in Walt Disney Imagineering, we apply software engineering principles to ensure our systems are highly reliable and efficient.

We deeply embed in engineering teams to continuously improve system performance and reliability. The Senior Systems Reliability Engineer is responsible for ensuring the stability, scalability, and performance of mission‑critical systems that support Disney’s innovative entertainment experiences. This role blends deep technical expertise with a passion for reliability, leveraging automation, monitoring, and incident management practices to enable Imagineering teams to deliver exceptional products and guest experiences.

Responsibilities

of Role:

Define, measure, and monitor service‑level indicators/objectives (SLIs/SLOs) and manage error budgets for critical services.
Participate in a rotating on‑call schedule and manage incident response, including remediation and blameless post‑mortems.
Collaborate closely with engineering and product to define reliability requirements and ensure deliverables meet agreed standards.
Identify and automate manual operational processes (“toil”) to improve system reliability and engineer productivity.
24×7 on‑call operational support.

Must Haves (Years of Experience, languages, programs, tools, etc.):

Minimum of 5 years of experience with relevant internet technologies and with implementing, administering, and supporting production websites and backend support systems.
Understand how to install and configure operating systems, specifically with expertise in Linux and Windows Server.
Software Development Continuous Integration (CI) knowledge in Git Lab CI or similar.
Experience with Source Control Management systems (Git).
Infrastructure as Code via Hashi Corp Terraform or Open Tofu.
Experience in AWS as well as good familiarity with Kubernetes.
Recognized as a subject‑matter expert on at least one OS and proficient in multiple operating systems, including OS performance monitoring, setup, configuration, tuning, and troubleshooting.
Understand internet technologies and network protocols, including HTTP, TLS, basic load balancing configurations, security zones, REST and DNS.
Able to implement existing base standards for new systems and/or applications with mentoring for all the following:

Site monitoring and instrumentation
Application monitoring and instrumentation
System monitoring and instrumentation
Resiliency and performance

Able to diagnose simple to complex system problems.
Able to author tools and scripts to be used by others to automate repeatable production tasks in standard languages like Bash, Python, Go, and Power Shell.
Advanced skills in at least one programming language such as Python, PHP, Ruby, Java, Go, Swift or C and able to build unit test suites for all software being developed.
Experience supporting and/or developing backend tools or services.
Able to perform and provide in depth analysis on load test runs against a moderately complex system.
Demonstrates exceptional troubleshooting methodology, including the ability to author and instruct new methodologies to the SRE team.
Independently resolve moderately to highly complex system…


Increase/decrease your Search Radius (miles)



Job Posting Language