Job Description & How to Apply Below
About the Role:
Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developer's mindset and toolkit to the challenges of building and running large-scale, distributed systems. Our mission is to engineer resilience from the ground up, enabling our product teams to innovate rapidly while ensuring our users have a stellar experience. We own the availability, latency, performance, and capacity of our platform, and we achieve our goals through a culture of data-driven decision-making, blameless learning, and relentless automation.
As a Senior Site Reliability Engineer, you are a hands-on engineer who blends deep software development expertise with a passion for operational excellence. You will be responsible for designing, building, and running the resilient, scalable, and increasingly self-healing systems that power our products. You will apply sound engineering principles to solve our most complex reliability challenges, with a mandate to automate everything, eliminate toil, and write robust, maintainable code.
You will be a force multiplier, mentoring other engineers and elevating the site reliability bar for the entire organization.
What You'll Do:
System Architecture & Design: Design, build, and maintain scalable, highly available, and fault-tolerant distributed systems. Partner with development teams as a reliability consultant, reviewing designs and influencing architectural decisions to ensure new services are built with reliability, observability, and performance as core principles, not afterthoughts.
Automation & Software Development: Write robust, performant, and maintainable code to automate operational tasks, and CI/CD pipelines. Build the internal tools, libraries, and frameworks that enable engineering teams to self-service their observability needs, reducing cognitive load and increasing their velocity.
Incident Response & Post-Mortem Analysis: Participate in a 24/7 on-call rotation, acting as a key technical leader and incident commander during critical service disruptions. Conduct deep, blameless root cause analyses (RCAs) that go beyond immediate fixes to identify and address systemic issues. Drive the implementation of corrective actions to prevent the recurrence of incidents.
Performance & Capacity Planning: Proactively monitor, measure, and optimize system performance to ensure low latency and high efficiency. Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding. Analyze usage patterns and historical data to forecast capacity needs, ensuring our platform stays ahead of customer demand.
Your Background:
Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience.
5+ years of professional experience in a Site Reliability Engineering, Dev Ops, or Software Engineering role with a focus on infrastructure and operations.
Strong programming proficiency in one or more high-level languages such as Rust, Go, Python, or Typescript. You should be comfortable writing, testing, and deploying production-grade code.
Deep knowledge of AWS services (especially networking, IAM, EKS, ALBs/NLBs, Route 53, Cloud Watch).
Proven experience with Kubernetes in production (EKS preferred), including service exposure, networking, and availability engineering.
A solid understanding of Linux/Unix operating systems, networking fundamentals (TCP/IP, DNS, HTTP), and the architecture of modern distributed systems.
Preferred Qualifications (Nice-to-Haves) Experience building and managing large-scale monitoring and observability systems using tools like Datadog, Prometheus, Grafana, etc.
Expertise in designing and…
Position Requirements
10+ Years
work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×