AI Evaluations Engineer
Job in
Manchester, Greater Manchester, M9, England, UK
Listed on 2026-05-17
Listing for:
ConnexAI
Full Time
position Listed on 2026-05-17
Job specializations:
-
IT/Tech
AI Engineer (Applied/Software), Data Analyst, Machine Learning/ ML Engineer, Data Scientist
Job Description & How to Apply Below
This role sits at the centre of how we measure and improve AI systems in production.
You’ll define what good performance means across LLMs, ASR, TTS, and full speech-to-speech pipelines, and build the datasets, metrics, and evaluation systems that make AI quality measurable and comparable in the real world.
You’ll work closely with engineering and product teams to ensure model changes lead to real improvements in user experience, not just better offline benchmarks.
What you’ll do- Design and run evaluations across LLM, ASR, TTS, and speech-to-speech systems
- Build real-world datasets and test cases from production behaviour and edge cases
- Define metrics and scorecards for model and system quality
- Benchmark internal models against external and frontier systems
- Build Python tools to automate evaluation workflows
- Create internal leaderboards, red-teaming setups, and regression tests
- Work with engineers and product teams to diagnose system failures
- Turn vague product goals into measurable evaluation frameworks
- Defining and measuring AI quality in production systems
- Turning real user behaviour into structured evaluation signals
- Ensuring model changes improve real-world performance
- Understanding why AI systems fail, not just whether they do
- You can translate improved quality into measurable metrics
- You think in terms of system impact (before vs after), not just accuracy
- You’re comfortable working across code, data, and production systems
- You care about real-world behaviour, not just benchmarks
- Strong Python (scripting, data analysis, tooling)
- Experience with ML systems, evaluation, or experimentation
- Understanding of LLMs or speech systems (ASR / TTS)
- Ability to design test cases and structured datasets
- Comfortable working with engineers and product teams
- Experience with LLM evaluation or benchmarking
- Exposure to speech or multimodal systems
- Familiarity with production APIs or ML systems
- Experience with automated testing or CI-style workflows
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×