Machine Learning Safety: Evaluation Research Engineer Job Seattle area,Washington USA,Software Development

Machine Learning Safety:
Evaluation Research Engineer

Seattle, Washington, United States Machine Learning and AI

This role supports the design and development of safety evaluation methodologies for generative and agentic AI features that enable users across the globe to interact with our media products and services.

Description

You will play an impactful role: shaping responsible AI and safety policies, evaluating fidelity to product safety requirements, creating risk assessments and taxonomies, curating exemplar safety evaluation datasets, and ensuring that evaluation frameworks are culturally and linguistically grounded. An ideal candidate possesses a strong understanding of issues in responsible AI and A and society, technology evaluation design principles and practices, and brings experience designing evaluations to support policies and/or product requirements, classification systems, and annotation and/or study participant guidelines.

Responsibilities

Taxonomy Development:
Design, refine, and maintain safety-relevant taxonomies that capture risk categories, content types, and policy distinctions, achieved through collaborations with subject matter experts who bring knowledge across languages and cultural contexts. You will work collaboratively to ensure taxonomies are comprehensive, internally consistent, and actionable for downstream evaluation work.
Policy-to-Data Translation:
Develop and validate exemplar sets that illustrate taxonomy categories, edge cases, and boundary conditions. Collaborate with language and cultural experts to ensure exemplars are culturally appropriate and representative across target markets. Partner with policy, product, and engineering teams to translate responsible AI policies and guidelines into concrete data requirements, annotation schemas, and evaluation criteria that can be operationalized across markets.

Develop and maintain synthetic data generation pipelines to augment evaluation coverage, stress-test safety boundaries, and support evaluation in low-resource languages. Ensure synthetic data is diverse, representative, and validated against human-generated benchmarks.
Automated Judge Development:
Shape the development, training and fine‑tuning, and validation of automated judge models that can reliably score AI system outputs for safety and policy compliance. Develop calibration and agreement metrics to ensure judges meet human‑parity benchmarks. Design and implement validation frameworks to assess the accuracy, reliability, and consistency of automated evaluation systems. Develop methods to detect drift, bias, and failure modes in automated judges across markets.
Scalable Analysis & Reporting Automation:
Create automated pipelines for analysis and reporting that reduce manual effort, increase reproducibility, and enable rapid cross‑market safety assessments. Build tooling that integrates with existing dashboards and reporting workflows.
Documentation & Communication:
Produce clear, detailed documentation artifacts. Present findings and recommendations to cross‑functional stakeholders including engineering, product, compliance, and policy teams.
Canonical Guideline Development:
Author and maintain canonical evaluation guidelines that standardize task definitions, rating criteria, and edge‑case handling. These assets will be adapted to scale across languages and markets, with the support of multi‑lingual and operations experts. You will ensure guidelines are clear, complete, and adaptable.
Evaluation Design & Execution:
Pilot and run evaluations with validated task setups, manage evaluation instruments and surface issues before full‑scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly. esign and run pilot evaluations to validate task setups, identify guideline ambiguities, calibrate annotator understanding, and surface issues before full‑scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly.
Monitoring & Data Quality:
Develop and implement monitoring frameworks to track evaluation progress, annotator performance, inter‑rater agreement, and data quality in real time. Flag…