Machine Learning Engineer, ML/GenAI Evaluation
Listed on 2026-06-14
-
IT/Tech
Machine Learning/ ML Engineer, AI Evaluation
Machine Learning Engineer, ML/GenAI Evaluation
San Diego, California, United States Software and Services
Would you like to contribute to Machine Learning and Generative AI technologies? Are you passionate about measuring what matters and ensuring AI systems work reliably for everyone? Do you believe that rigorous evaluation — including holding models accountable to fairness standards — is what separates great ML from good ML? We truly believe it is! We are defining what exceptional looks like for machine learning across Wallet, Payments, and Commerce.
As a Machine Learning Engineer specializing in Evaluation, you will establish the evaluation criteria, metrics frameworks, and quality standards that determine when models are ready to reach hundreds of millions of users. Your judgment shapes model quality and earns the confidence to ship. You'll work at the intersection of rigorous ML science and high-impact product decisions, collaborating closely with ML Engineering, Product, Privacy, and Legal teams.
This unique opportunity puts you at the center of model quality — designing adversarial test strategies, surfacing failure modes before they reach users, and owning the sign-off process that ensures Apple's financial features meet the highest bar for accuracy, robustness, and reliability.
The ideal candidate is a rigorous, curious ML practitioner who believes that how you measure a model is just as important as how you train it. You think critically about what metrics actually capture, know how models break in the real world, and hold quality standards others find uncomfortably high — including on dimensions like fairness. You will own the full evaluation lifecycle for ML models across Wallet features — designing test frameworks, adversarial corpora, and benchmarks that reflect the diversity of Apple's global user base, then making the final quality call before any model ships.
Your findings directly shape model development priorities and product decisions at scale.
- Define evaluation criteria and quality metrics for ML models powering Wallet features
- Design and maintain structured test sets covering the full diversity of real-world scenarios — varied document formats, distributions, languages, edge cases, and adversarial inputs.
- Develop evaluation methodologies for robustness testing: distribution shift, out-of-distribution generalization, temporal drift, and aggressor scenarios
- Own fairness evaluation end-to-end — define fairness metrics appropriate to each Wallet feature, build bias test suites across protected attributes and user populations, measure disparate performance across subgroups, and gate model launches on fairness criteria with the same rigor as other conventional metrics.
- Build user persona–stratified benchmarks that reflect the breadth of Wallet's global user population across spending patterns, locales, and document types
- Evaluate generative and agentic model outputs — assessing hallucination rates, faithfulness, and groundedness using LLM-as-a-judge frameworks, human evaluation protocols, and prompt regression testing
- Own model quality sign-off — establish the launch criteria, run final evaluations, and make the call on model readiness before any feature ships
- Synthesize evaluation results into clear, actionable insights that guide model development priorities and product decisions
- Partner with ML engineers and Quality engineers to identify failure modes early in the development cycle and close the loop between evaluation findings and model improvements
- Establish and evangelize evaluation best practices across the Wallet ML team, raising the quality bar for how models are tested, monitored, and maintained post-launch
- M.S. in Machine Learning, Computer Science, Statistics, Applied Mathematics, or a related technical field strongly preferred.
- Bachelor's degree with 7+ years hands‑on experience in ML evaluation, model quality, or applied research will be considered
- 5+ years of hands‑on ML experience, with deep expertise in model evaluation, offline metrics design, and behavioral testing
- Strong track record designing evaluation frameworks for…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).