ML Engineer - Automated Evaluation and Adversarial Design
Listed on 2026-06-04
-
Software Development
AI Engineer, Machine Learning/ ML Engineer
ML Engineer - Automated Evaluation and Adversarial Design
Seattle, Washington, United States Software and Services
The Productivity and Machine Learning Evaluation team ensures the quality of AI-powered features across a suite of productivity and creative applications; including Creator Studio, used by hundreds of millions of people. This team serves as the primary evaluation function, providing critical quality signals that directly influence model development decisions and product launches. This role focuses on building and scaling automated evaluation systems and designing adversarial and stress-testing methodologies across multiple AI features.
The work requires a deep understanding of how AI systems fail and how to measure quality rigorously. As features evolve from single-turn interactions into multi-turn, agentic experiences, the evaluation challenge shifts from assessing individual outputs to stress-testing entire conversation flows and agent decision chains. This is an opportunity to shape the evaluation infrastructure that determines whether AI features meet the bar for hundreds of millions of users.
Day-to-day work involves designing, building, and maintaining automated evaluation systems that assess AI feature quality at scale, including multi-turn conversation evaluation and end-to-end agent workflow testing. This includes creating adversarial test suites that probe model weaknesses and running stress tests to ensure features perform under demanding conditions, with particular focus on failure modes that only emerge across extended interactions, such as: context degradation, goal drift, and compounding errors.
Typical deliverables include: evaluation frameworks and rubrics, quality assessment reports, adversarial test case libraries, multi-turn stress-test pipelines, and recommendations on model readiness.
- Define and own the automated evaluation approach for AI features, translating qualitative notions of quality into measurable, reproducible assessments across both single-turn and multi-turn agentic experiences
- Build adversarial test suites that target known and emerging model failure modes, including edge cases relevant to productivity application workflows including conversation-level failures such as context loss, instruction forgetting, and cascading errors across multi-step tasks
- Develop and execute stress test protocols that validate minimum performance thresholds under atypical input conditions including extended conversation lengths, adversarial mid-conversation topic shifts, and complex tool-use sequences
- Ensure alignment between automated and human evaluation methods on an ongoing basis, identifying and resolving systematic disagreements
- Collaborate with engineering partners to integrate evaluation into development and release workflows
- Scale adversarial test case generation and stress test execution, leveraging automation where appropriate, including programmatic generation of multi-turn conversation scenarios and agent interaction traces
- Influence model and feature quality decisions by communicating evaluation findings and readiness assessments to cross-functional partners
- Bachelor’s degree in Computer Science, Machine Learning, Statistics, or a related field
- 4+ years of experience building or significantly extending ML evaluation systems, including designing evaluation benchmarks or quality assessment frameworks including evaluation of sequential or multi-step AI outputs
- Experience independently defining evaluation architecture and methodology for AI or ML systems with the ability to design evaluation approaches where the unit of analysis is a conversation or session rather than a single output
- Experience designing adversarial or red-teaming test methodologies for ML models or AI-powered features including adversarial scenarios that target failures across multi-turn interactions
- Experience with Python and ML frameworks (PyTorch, Tensor Flow, or equivalent) in production or near-production settings
- Track record of owning technical direction for evaluation efforts across multiple features or product areas
- Ex…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).