×
Register Here to Apply for Jobs or Post Jobs. X

Evaluation Engineer

Job in Oakland, Alameda County, California, 94616, USA
Listing for: Elicit
Full Time position
Listed on 2026-02-16
Job specializations:
  • Software Development
    Software Engineer
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

About Elicit

Elicit is an AI research platform that uses language models to help researchers figure out what’s true and make better decisions, starting with common research tasks like literature review.

What we’re aiming for:

  • Elicit radically increases the amount of good reasoning in the world.

    • For experts, Elicit pushes the frontier forward.

    • For non-experts, Elicit makes good reasoning more affordable. People who don’t have the tools, expertise, time, or mental energy to make well-reasoned decisions on their own can do so with Elicit.

  • Elicit is a scalable ML system based on human-understandable task decompositions, with supervision of process, not outcomes. This expands our collective understanding of safe AGI architectures.

  • Visit our Twitter to learn more about how Elicit is helping researchers and making progress on our mission.

    The mission of Elicit evals

    Some orgs build evals to warn us about dangerous capabilities. Others build evals to understand trends and predict future developments. Yet others build evals to hill-climb towards models that users will like more.

    At Elicit, we’re focused on something different—we want to understand, and hill-climb towards,
    models that help us make better decisions
    .

    This is tougher than "what will users like better"—it’s hard to evaluate decision support, and users’ knee-jerk reactions may not align with what actually helps for decision-making. Because it’s hard, and because the sales pitch is more complicated, there aren’t many doing this well. If we nail this, we have a unique opportunity to push AI toward helping us make better decisions, both within Elicit and beyond.

    Why

    we’re hiring for this role

    We need someone to own the technical foundation of our auto-evaluation systems. Our evals are currently much slower than they need to be, and our interfaces aren’t optimized for the diverse set of people who need to use them—ML engineers iterating on models, product managers monitoring quality, and customers assessing trust in results.

    The right person for this role won’t just build infrastructure. You’ll think deeply about what it actually means for Elicit to help with decision-making in pharma and encode that understanding into our evaluation systems.

    What you’ll own

    The core auto-eval platform

    You’ll build a comprehensive system that runs fast, is easy to use, and supports quickly building new evals:

    • Speed: You’ll build a lightning-fast basic evals infrastructure that schedules tasks to introduce practically no latency; and then you’ll figure out clever ways to solve the fundamental sources of latency (building a version of Elicit, running it on a query, and evaluating it using LMs)

    • Interfaces: ML engineers need evals to kick off automatically on relevant commits, with results they can see at a glance and drill into. Product managers need dashboards showing performance over time and what’s going wrong in production.

    • Architecture: Your code must be well-architected so other team members and ML engineers can understand and build on it. An engineer starting on a new feature should be able to quickly add examples and run an eval.

    Ensuring evaluations are accurate and reliable

    • We need to evaluate how well Elicit actually helps with decision-making in pharma, not just measure what’s easy to measure. This requires encoding real knowledge about how pharma customers make decisions (for example, choosing appropriate gold standards).

    • You’ll provide appropriate statistical tests and confidence intervals so we can trust our results.

    A month in your life

    In a typical month, expect to spend:

    • 60% working on the core eval platform

    • 15% working closely with the evals team to build and improve specific evals (e.g., an eval of our paper search within our systematic review flow)

    • 10% mentoring our evals engineering intern

    • The rest on learning how people interact with the eval system so you can make it work better for them, and understanding what our users want from Elicit so evals measure what matters

    What you bring to the role

    Requirements

    • At least 3 years of experience as a professional software engineer, with demonstrated experience building complex backend systems (e.g., backend for a complex…

    To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
    (If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
     
     
     
    Search for further Jobs Here:
    (Try combinations for better Results! Or enter less keywords for broader Results)
    Location
    Increase/decrease your Search Radius (miles)

    Job Posting Language
    Employment Category
    Education (minimum level)
    Filters
    Education Level
    Experience Level (years)
    Posted in last:
    Salary