ML Safety Research Engineer
Listed on 2026-01-09
-
Software Development
AI Engineer, Machine Learning/ ML Engineer, Data Scientist, Data Science Manager
San Francisco, California, United States Machine Learning and AI
Apple Services Engineering (ASE) powers many AI features across App Store, Music, Video and more. We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating systemic biases and maintain safe and trustworthy experiences across our AI tools and models.
DescriptionOur team, part of Apple Services Engineering, is looking for an ML Research Engineer to lead the design and continuous development of automated safety benchmarking methodologies. In this role, you will investigate how media‑related agents behave, develop rigorous evaluation frameworks and techniques, and establish scientific standards for assessing risks they pose and safety performance. This role supports the development of scalable evaluation techniques that ensure our engineers have the right tools to assess candidate models and product features for responsible and safe performance.
The capabilities you build will allow for the generation of benchmark datasets and evaluation methodologies for model and application outputs, at scale, to enable engineering teams to translate safety insights into actionable engineering and product improvements. This role blends deep technical expertise with strong analytical judgment to develop tools and capabilities for assessing and improving the behavior of advanced AI/ML models.
You will work cross‑functionally with Engineering and Project Managers, Product, and Governance teams to develop a suite of technologies to ensure that AI experiences are reliable, safe, and aligned with human expectations. The successful candidate will take a proactive approach to working independently and collaboratively on a wide range of projects. In this role, you will work alongside a small but impactful team, collaborating with ML and data scientists, software developers, project managers, and other teams at Apple to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks.
- Design scientifically‑grounded benchmarking methodologies covering multiple dimensions of responsibility and safety across several media and application marketplace use cases
- Develop automated evaluation pipelines that collect, automatically judge, and analyze model outputs with respect to safety policies, at scale
- Create and curate datasets, tasks, and feature usage scenarios that represent realistic and adversarial use cases across multiple languages, markets, and domains
- Define and validate new metrics for complex phenomena such as multi‑turn agentic interaction patterns
- Apply statistical rigor and reproducibility to above mentioned objectives
- Work closely with engineering and research teams to translate experimental findings into actionable model improvements and safety mitigations
- Publish internal reports and external papers
- Monitor evolving industry practices and academic work to ensure benchmarks remain relevant
- Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience
- 1+ years of work experience either as a postdoc or in the industry
- Strong research background in empirical evaluation, experimental design, or benchmarking
- Strong proficiency in Python (pandas, Num Py, Jupyter, PyTorch, etc.)
- Deep familiarity with software engineering workflows and developer tools
- Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems
- Strong analytical and communication skills, including the ability to write clear reports
- Technical
Skills: - Proficiency in Python (pandas, Num Py, Jupyter, PyTorch, etc.).
- Experience working with large datasets, annotation tools, and model evaluation pipelines
- Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns
- Ability to design taxonomies, categorization schemes, and structured labeling frameworks
- Analytical Strength:
Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights - Communication:
Strong ability to…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).