More jobs:
Job Description & How to Apply Below
We are looking for a senior hands-on expert who can take speech systems from raw audio to reliable production features. You will build and improve core speech capabilities such as ASR, TTS, voice conversion, and speech-to-speech workflows, and you will also own the engineering work that makes them fast, scalable, and measurable in the real world.
About the Role
This role is a strong fit if you enjoy the full stack of speech AI: signal processing intuition, modern deep learning, decoding and streaming constraints, and practical deployment trade-offs.
Responsibilities
Speech modeling that ships
Build, train, and iterate on ASR models for real-world conditions such as conversational speech, accents, noise, and far-field audio, with strong offline and online evaluation discipline.
Develop and improve TTS systems that are natural, low-latency, and stable on speaker identity and prosody, with production-quality inference constraints.
Work on voice conversion and accent conversion when needed, preserving intelligibility, naturalness, and speaker identity in streaming settings.
Decoder and streaming engineering
Design and implement decoding stacks using proven libraries and patterns, including Kaldi and OpenFST, and features like custom vocabulary injection, language model rescoring, and beam search tuning.
Build streaming inference systems with strict latency budgets and predictable behavior at scale, including monitoring and continuous improvement loops.
Speech analysis and speech intelligence
Deliver speech analytics building blocks such as VAD, diarization, speaker recognition, and quality analytics that improve end-to-end product outcomes.
Design robust evaluation harnesses and datasets for real user scenarios, including domain adaptation and behavior tuning across use cases.
GenAI and LLM integration for voice experiences
Integrate speech components into LLM-based systems, including cascaded ASR plus LLM plus TTS pipelines, and drive joint optimization where it materially improves product quality.
Build or extend speech generation capabilities including voice cloning, controllable prosody, and modern generative architectures where relevant to the roadmap.
Production deployment and operational excellence
Own end-to-end delivery: prototyping, ablations, training, evaluation, optimization, deployment, and post-launch monitoring.
Partner closely with product and platform teams to integrate models into real-time systems and maintain reliability, uptime, and quality under production traffic.
Qualifications
6+ years building production-grade speech or audio ML systems, or equivalent depth through research plus shipped production impact.
Strong programming ability in Python, plus comfort in C or C++ for performance-critical components.
Proven expertise in deep learning for speech (PyTorch or Tensor Flow) and practical model training and serving.
Solid fundamentals in speech and audio, including signal processing concepts and real-world acoustic variability.
Experience deploying models into real-time or high-throughput systems, including evaluation, scalability, and production reliability.
Required Skills
Hands-on experience with decoding tool chains and speech customization, including WFST concepts, beam search, and LM rescoring.
Experience with conversational or telephony speech systems, where latency, robustness, and product polish matter more than leader board wins.
Experience with generative speech systems such as voice cloning, flow matching, diffusion or autoregressive Transformers, and model optimization for real-time inference.
Familiarity with modern speech stacks and frameworks such as NVIDIA NeMo (or comparable) for ASR and TTS workflows.
Publications or strong open-source contributions in speech and audio AI.
Pay range and compensation package
Freshers do not apply for the job.
Equal Opportunity Statement
We are committed to diversity and inclusivity.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×