AI Evaluations Engineer

Job in Manchester, Greater Manchester, M9, England, UK

Listing for: ConnexAI

Full Time position
Listed on 2026-05-17

Job specializations:

IT/Tech
AI Engineer (Applied/Software), Data Analyst, Machine Learning/ ML Engineer, Data Scientist

Salary/Wage Range or Industry Benchmark: 80000 - 100000 GBP Yearly GBP 80000.00 100000.00 YEAR

This role sits at the centre of how we measure and improve AI systems in production.

You’ll define what good performance means across LLMs, ASR, TTS, and full speech-to-speech pipelines, and build the datasets, metrics, and evaluation systems that make AI quality measurable and comparable in the real world.

You’ll work closely with engineering and product teams to ensure model changes lead to real improvements in user experience, not just better offline benchmarks.

What you’ll do

Design and run evaluations across LLM, ASR, TTS, and speech-to-speech systems
Build real-world datasets and test cases from production behaviour and edge cases
Define metrics and scorecards for model and system quality
Benchmark internal models against external and frontier systems
Build Python tools to automate evaluation workflows
Create internal leaderboards, red-teaming setups, and regression tests
Work with engineers and product teams to diagnose system failures
Turn vague product goals into measurable evaluation frameworks

What this role is about

Defining and measuring AI quality in production systems
Turning real user behaviour into structured evaluation signals
Ensuring model changes improve real-world performance
Understanding why AI systems fail, not just whether they do

What good looks like

You can translate improved quality into measurable metrics
You think in terms of system impact (before vs after), not just accuracy
You’re comfortable working across code, data, and production systems
You care about real-world behaviour, not just benchmarks

Core skills

Strong Python (scripting, data analysis, tooling)
Experience with ML systems, evaluation, or experimentation
Understanding of LLMs or speech systems (ASR / TTS)
Ability to design test cases and structured datasets
Comfortable working with engineers and product teams

Nice to have

Experience with LLM evaluation or benchmarking
Exposure to speech or multimodal systems
Familiarity with production APIs or ML systems
Experience with automated testing or CI-style workflows

#J-18808-Ljbffr