×
Register Here to Apply for Jobs or Post Jobs. X

AI Benchmark Engineer | Native Language Specialist - Turkish - Remote

Remote / Online - Candidates ideally in
Salem, Connecticut, 06420, USA
Listing for: LILT, Inc
Remote/Work from Home position
Listed on 2026-07-01
Job specializations:
  • Science
    AI Evaluation, Data Annotation/ AI Labeling
Job Description & How to Apply Below
Location: Salem

About The Opportunity

We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows.

We are seeking experienced native-speaking software engineers to design, build, and validate these benchmarks. You will create high-signal, high-quality tasks that genuinely test a model's ability to handle multilingual environments without relying on English translation crutches.

Note this is a remote, freelance opportunity

What You'll Deliver
  • Task Engineering:
    Evaluating Coding Agents.

  • Asset Creation:
    Build realistic task environments using datasets and files in your native language. Crucially, these assets must remain in the target language to genuinely measure multilingual handling.

  • Prompting & Translation: finding failure points where AI does not work, in your native language

  • Implementation & Verification:
    Support the development of robust solutions (reference implementations) and write highly reliable, deterministic verifier scripts (using rubric-based judging only when strictly necessary).

  • Calibration & Execution:
    Analyze execution logs and calibrate task difficulty (Easy to Very Hard) using standard Terminal-Bench run configurations against various model tiers (Haiku, Sonnet, Opus).

  • Quality Assurance:
    Participate in a rigorous, 4-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to ensure fairness, grammatical accuracy, and benchmark integrity.

Qualifications
  • Experience: 5+ years of industry experience in software engineering.

  • Background: Proven track record at leading technology companies and/or graduation from top-tier engineering universities.

  • Language: Native or near-native fluency, with a deep understanding of its grammar, register, and phrasing rules. High English proficiency.

  • Technical Stack: Strong proficiency in Python, standard shell scripting, and data processing.

  • Workflow: Extensive experience with Terminal/CLI-based development workflows and a working familiarity with coding agents.

  • Domain Expertise: Deep technical understanding of multilingual text processing pitfalls, including:

    • Encoding/decoding robustness and Unicode normalization.

    • Locale-dependent conventions (collation, casing, non-Gregorian dates).

    • Text I/O, toolchain interoperability, and safe string operations.

    • (For specific languages) Bidirectional/RTL handling, font fallbacks, and rendering/typography in UI or artifacts.

Why Collaborate with Lilt?
  • Your schedule, your rules. As an independent contractor, work when you want, as much or as little as you want. No fixed hours, no check-ins, no micromanaging.

  • Get paid quickly and fairly. We respect your time and your expertise. Competitive rates, prompt payments, no chasing invoices.

  • Work on projects that actually matter. Contribute to cutting-edge AI and language technology that is shaping how humans and machines communicate.

  • Be part of something bigger. Join a global community of linguists, subject matter experts, and language professionals who are advancing human knowledge together.

  • Grow without limits. As a Lilt contractor you get access to diverse, innovative projects that expand your portfolio and sharpen your skills across industries and domains.

  • Have fun doing what you love. Bring your language skills to life on projects that are as interesting as they are impactful.

  • We are building a rigorous, verifiable evaluation suite of Terminal-Bench tasks designed to test the limits of large language models on multilingual software challenges. Our goal is to measure multilingual robustness across prompt language effects, non-English data processing, and complex locale/encoding edge cases in terminal workflows.

How to Join Our Expert Community

1 - Submit your application including an updated copy of your CV in English

2 - Next, complete a GenAI assessment to evaluate your skills

3 - Finalize onboarding and profile set-up in our system, and become eligible for…

To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary