×
Register Here to Apply for Jobs or Post Jobs. X

Research Internship: Universal Phonetizer -Generation Voice AI

Job in Zürich, 8058, Zurich, Kanton Zürich, Switzerland
Listing for: Agigo AG
Apprenticeship/Internship position
Listed on 2026-06-11
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software), Data Scientist, Machine Learning/ ML Engineer
Salary/Wage Range or Industry Benchmark: 30000 - 80000 CHF Yearly CHF 30000.00 80000.00 YEAR
Job Description & How to Apply Below
Position: Research Internship: Universal Phonetizer for Next-Generation Voice AI
Location: Zürich

Research Internship:
Universal Phonetizer for Next-Generation Voice AI

Full-time | Voice & Conversational AI | Global Enterprise AI Platform

Duration: 4-8 Months

About AGIGO

AGIGO™ is the first enterprise-grade conversational AI platform that empowers enterprises to transform customer engagement and business performance with high-agency AI-agents - agents that match well-trained human customer agents in naturalness, responsiveness, and autonomous task resolution. Built for on-premises or hybrid deployment, with no reliance on third-party services, our proprietary platform gives enterprises full control, observability, and data sovereignty. Its unified core, tunable base models, and end-to-end design toolchain deliver context-aware, adaptable agents that engage directly with customers in real-time.

Founded February 2025 in Switzerland by a team of 18 experienced AI pioneers, AGIGO is driven by a bold vision to lead the next major wave in AI by transforming how businesses interact with their customers.

Your Research Mission

In this internship, your mission is to architect and train a dynamic, universal neural phonetizer, which - based on AGIGO’s ground-breaking proprietary innovation - is capable of inferring the correct pronunciation of any word, including new or foreign ones. When brought to a production-ready state, the model will replace static G2P dictionaries, thereby finally solving a long-standing problem that has hampered the user experience of voice-based conversational AI systems for decades.

The phonetizer should handle accent variations with high accuracy and low latency and be designed for seamless integration with LLM-based voice synthesis systems. At the forefront of Voice-AI innovation, your project will lay the foundation for commercial application of recent AGIGO inventions, further strengthen AGIGO’s leadership in voice synthesis, and ultimately contribute to an enhanced multilingual user experience of voice-enabled applications.

Phase 1:
Data Foundation - Large-Scale Aligned Text Corpus Creation

Your first task will be to engineer a robust data-processing pipeline to create a massive, high-quality training corpus of isolated words, with phonemes, i.e., triplets (word, phoneme sequence, audio). This involves:

Forced Alignment at Scale: You will utilize and refine forced alignment tools to process over 100,000 hours of multi-lingual speech. The goal is to obtain precise time-stamps for every phoneme in the dataset, creating a vast corpus of (, ) pairs.

Data Curation and Normalization: You will develop strategies to filter noisy alignments and normalize text while handling with variations in pronunciation across our diverse datasets. This foundational work is critical for training a state-of-the-art model.

While we are open to exploring large, LLM-based sequence-to-sequence models, our primary focus is on production viability. Therefore, we prioritize non-autoregressive (NAR) architectures for their superior inference speed and low-budget requirements, even the possibility to run on CPU only machines.

The Model: The proposed architecture consists of a powerful, pre-trained speech encoder (e.g., a wav2vec2-style or Hubert encoder) followed by a linear projection layer that maps to phonemes from the International Phonetic Alphabet (IPA).

The Training Objective: The model will be trained end-to-end for phoneme recognition using the Connectionist Temporal Classification (CTC) loss function. This NAR approach predicts a sequence of phonemes for a given audio input in a single forward pass, making it extremely efficient. We are also open to explore recent or different loss functions and methodologies.

Evaluation: The system will be evaluated with Phoneme Error Rate (PER) on held-out test sets. The extrinsic evaluation will involve integrating your model into our end-to-end TTS pipeline and measuring the impact on synthesized speech quality and intelligibility (e.g., via aggregated ASR-based WER) and some other well-known objective metrics for TTS systems.

This is where we move beyond the baseline. This project is a chance for deep applied research with a direct impact. We are eager to explore and innovate with you on topics like:

Disentangled Representation: Can we train the model to separate phonetic content from speaker identity or prosodic information, leading to a more robust phoneme recognizer?

Extending the phonetic dictionaries: Can we extend the phonetic dictionary to bi-phone or tri-phone systems?

Zero-Shot Cross-Lingual Phonetization: By training on a diverse set of languages, can the model generalize to pronounce words from unseen languages by learning a universal phonetic space?

Accent and Dialect Modeling: You can investigate conditioning the model on language/accent tags to produce tailored pronunciations, e.g., generating a US-English vs. UK-English phoneme sequence for "schedule".

Your Impact

The final, trained model and system will be integrated directly as a core component in our production…

Position Requirements
Less than 1 Year work experience
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary