Sr. Engineering Manager, Tooling and Reliability Platforms
Listed on 2026-05-20
-
IT/Tech
Cloud Computing, Systems Engineer
It takes powerful technology to connect our brands and partners with an audience of hundreds of millions of people. Whether you're looking to write mobile app code, engineer the servers behind our massive ad tech stacks, or develop algorithms to help us process trillions of data points a day, what you do here will have a huge impact on our business-and the world.
ALittle About Us
Our Tooling and Reliability Platforms team operates as a foundational pillar of the Central Technology Organization. We provide the "paved road" for Yahoo's diverse verticals, enabling them to ship world-class products at a global scale. Our mission is to build modern, secure, and highly efficient platforms that power all of Yahoo's brands, with a relentless focus on Engineered Resilience.
A Lot About YouWe are looking for a strategic Senior Engineering Manager (M4) to lead our Tooling & Reliability Platforms team. You are a Product Lead for the "paved road" of reliability at Yahoo, managing a large squad of engineers responsible for our incident management ecosystem while evolving these tools into a comprehensive, AI-augmented Reliability Platform.
You are strategic about the north star of Engineered Resilience, owning the roadmap for automated diagnostics and chaos engineering. You foster a culture of high-trust and continuous experimentation, where engineers are empowered to use modern tools to solve complex reliability challenges. You understand that in a modern engineering org, reliability is achieved through a mix of elite software engineering and intelligent automation.
Key Responsibilities- Engineering Leadership & Productivity:
Manage and grow a high-performing team. Identify and implement AI-driven efficiencies in the product lifecycle to accelerate platform delivery and engineering productivity. - Product & Workflow Ownership:
Treat the reliability stack as a product. Define the roadmap for the Incident Management platform, ensuring these tools reduce cognitive load for hundreds of service teams by replacing manual investigation steps with AI-assisted workflows. - AIOps & Governance:
Drive the integration of GenAI and SRE Agents into production environments. Establish frameworks for validating AI-generated incident summaries and hypothesis generation to ensure accuracy and prevent automated hallucinations. - Resilience Engineering:
Define the vision for the next generation of Resilience Engineering, focusing on building services that make products inherently resilient through automated alert diagnostics and self-healing systems. - Vendor Advocacy:
Act as a high-leverage partner to our key vendors, holding them accountable for roadmap delivery and ensuring their features align with our team vision.
- A Builder & A Leader:
Experience managing manager-level or senior IC reports in a high-scale environment, with a track record of building internal platforms. - Product-Minded:
You don't just "install" tools; you architect a "paved road" that engineers want to use, focusing on reducing friction through intelligent automation. - AI-Forward:
You possess a commitment to combining SRE with LLMs and have the expertise to convert AI potential into effective, real-world automation and structured prompt interaction with AI tools. - Strategic & Adaptive:
Ability to manage day-to-day operations while pivoting strategy to account for emerging AI-driven reliability trends.
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 5+ years of experience leading SRE or Dev Ops teams in a high-scale, cloud-native environment.
- Strong background in Software Engineering (Python, Go, or Java) and Infrastructure-as-Code.
- Deep familiarity with incident management and AIOps tools (e.g., Rootly, Pager Duty, Big Panda).
- Experience evaluating and refining AI-generated outputs in a technical or operational context.
- Proven ability to collaborate with SaaS partners to influence a collective product vision.
- Comfort operating in an evolving AI-augmented environment with a focus on continuous learning.
- East coast timezone preference
- Experience with BCP/DR planning or…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).