Language engineer
Listed on 2026-01-09
-
IT/Tech
Data Scientist, Data Analyst
We are a dynamic and innovative small-sized SaaS company specializing in language data products and services. We are a team of 17, distributed across two offices in Amsterdam and Thessaloniki.
About the ProjectTAUS is executing technical work streams for the European Commission’s BEACON project, focused on collecting, curating, and publishing high-quality parallel text corpora for machine translation in EU candidate country languages. This 9-month project involves processing hundreds of millions of sentences from diverse sources, applying rigorous quality assurance frameworks, and preparing publication-ready datasets for seven language pairs:
English paired with Ukrainian, Serbian, Bosnian, Macedonian, Albanian, Montenegrin, and Romanian/Moldovan, with particular focus on legal and administrative domains.
We seek a skilled and motivated Language Data Engineer to join our technical team for large-scale parallel corpus collection, processing, and quality assurance. You will work hands‑on with real‑world challenges in low‑resource language processing, quality assurance at scale, and contribute directly to expanding Europe’s multilingual digital infrastructure.
Responsibilities- Download and catalog parallel corpora from public repositories and implement targeted web crawling for legal/administrative domain content.
- Extract text from diverse formats (PDFs, HTML, document archives) and apply bilingual as well as monolingual corpus mining techniques.
- Document source provenance, licensing, and metadata comprehensively.
- Execute preprocessing pipelines: format normalization, sentence segmentation, alignment, language identification, and quality filtering.
- Handle large‑scale data processing with deduplication and anonymization.
- Maintain detailed processing logs and quality metrics throughout the pipeline.
- Validate NLP tool performance across seven language pairs and implement automated quality checks (alignment confidence, language , domain classification).
- Coordinate with linguists for human validation and generate quality reports with statistical metrics.
- Troubleshoot and resolve quality issues in processing workflows.
- Contribute to technical deliverables and project documentation meeting EC standards.
- Collaborate with European Commission experts and cross‑functional teams on methodology and quality criteria.
- Ensure compliance with EU data governance, GDPR, and licensing requirements.
TAUS
Qualifications- 3+ years of work experience with Natural Language Processing (NLP)
- 3+ years of work experience with Python (Programming Language)
- Authorized to work in Yes
Mid Career (2+ years of experience)
Tagged as:- Classification
- Industry
- Machine Translation
- Natural Language Processing
- Netherlands
- NLP
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search: