×
Register Here to Apply for Jobs or Post Jobs. X

Member of Technical Staff, ML Research Engineer

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Arcada Labs Incorporated
Full Time position
Listed on 2026-06-18
Job specializations:
  • Research/Development
    AI Evaluation, Data Annotation/ AI Labeling
  • IT/Tech
    AI Evaluation, Data Annotation/ AI Labeling
Salary/Wage Range or Industry Benchmark: 125000 - 150000 USD Yearly USD 125000.00 150000.00 YEAR
Job Description & How to Apply Below

About

AI systems are getting better on benchmarks, but still fail in real-world use.

At Arcada Labs, we build products used by millions of people around the world that give us direct access to real human preference and judgment. That lets us evaluate models on what people actually care about, not just what benchmarks happen to measure.

Our products have reached millions of users across 190+ countries and are already used by frontier labs. We’ve collaborated on announcing model releases with OpenAI, xAI, Meta, and Google Deep Mind, and more.

Whoever defines the evaluations defines what models become good at. We create the evolutionary pressure that pushes models toward what people actually want.

We’re a small, deeply technical team with people from Harvard, Berkeley, Apple, Microsoft, Amazon, and Meta, backed by Index Ventures, YC, Conviction, SV Angel, Box Group and others.

About the Role

We’re looking for an ML Research Engineer to help us build better ways to evaluate and understand real AI capabilities.

You’ll design and run experiments that turn millions of human preference into reliable signals about what makes models useful, trustworthy, and capable in practice (design taste, agent behavior, multi-step tasks, reasoning, etc.). Your work will shape our public leaderboards and the evaluation tools we share with frontier labs.

You’ll work at the intersection of engineering, ML, and research - deciding what to evaluate, how to evaluate it (using real human preference data and other signals), and how to turn those results into better rankings and insights.

What You’ll Own
  • Design and run large-scale evaluations that measure how frontier models perform in real-world workflows
  • Turn human preference votes and interaction traces into reliable signals about model capability, taste, reasoning, robustness, and agent behavior
  • Develop ranking systems, analysis pipelines, and experimental methods for comparing models
  • Identify where models fail, why they fail, and what those failures reveal about the next frontier of capability
  • Work with engineers to turn research findings into user-facing products, leaderboards, and tools for frontier labs
  • Contribute to internal research reports, external publications, and customer-facing analyses
What We’re Looking For
  • Experience training, fine-tuning, or evaluating models, including LLMs, reward models, preference models, or RLHF/DPO-style systems
  • Prior research experience, publications, open-source work, or hands‑on work with frontier models
  • Strong familiarity with modern AI systems, model evaluation, agentic workflows, and frontier model behavior
  • Ability to turn vague real-world problems into concrete evaluation tasks, experiments, and measurable systems
  • Strong experimental judgment, including confidence with noisy human preference data, statistical rigor, and imperfect real-world signals
  • Good taste for what matters in model behavior - and a strong desire to advance model progress
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary