More jobs:
Member of Technical Staff; Data Acquisition
Job in
Palo Alto, Santa Clara County, California, 94306, USA
Listed on 2026-05-31
Listing for:
Sanas
Full Time
position Listed on 2026-05-31
Job specializations:
-
Software Development
Data Engineering
Job Description & How to Apply Below
Member of Technical Staff (Data Acquisition) About the Role
Your mission is to build and operate the ingestion systems that turn the open web and large-scale audio sources into reliable, well-structured corpora for training Sanas's frontier speech models. You'll own the machinery that acquires, extracts, filters, versions, and delivers audio data to our training pipelines — and you'll work directly with our research scientists to close the loop between what we collect and how it moves model quality.
Job Description- Own and lead engineering projects across the full data acquisition stack — web crawling, audio ingestion, source discovery, and dataset delivery to training pipelines.
- Build and operate large-scale distributed crawling infrastructure capable of continuously discovering and ingesting audio at scale across languages, accents, domains, and recording environments.
- Develop specialized crawlers for high-priority audio sources with source-specific extraction and normalization logic.
- Run experiments to evaluate crawling strategies, extraction methods, and ingestion tradeoffs; analyze results to identify gaps, redundancy, and coverage improvements across speaker demographics and language pairs.
- Build ingestion pipelines that scale reliably across large data campaigns, with automated audio quality filtering — SNR estimation, clipping detection, codec artifact identification — as a first-class pipeline stage.
- Design and deploy highly scalable distributed systems capable of handling petabytes of audio data — from raw acquisition through quality filtering, deduplication, segmentation, and versioned dataset generation.
- Architect and implement indexing and search capabilities over large audio corpora — enabling fast lookup by language, speaker, acoustic condition, duration, and quality tier.
- Build and maintain backend services for data storage, including key-value databases, metadata synchronization, and manifest management across dataset versions.
- Deploy and operate acquisition infrastructure in a Kubernetes / Infrastructure-as-Code environment; perform routine system health checks and respond to production issues quickly.
- Collaborate closely with data processing, architecture, and ML platform teams to ensure smooth data flow from acquisition through to training‑ready outputs.
- Work closely with legal to handle compliance, data privacy, and licensing matters across all acquisition sources — maintaining a clear audit trail of provenance, permitted use, and commercial training rights for every dataset.
- Enforce speaker consent documentation, GDPR requirements, robots.txt and ToS adherence, and audio retention policies across all ingestion pipelines.
- Manage relationships with third‑party data vendors — writing precise acquisition briefs, evaluating quality on delivery, and ensuring sourced data meets Sanas's licensing and quality standards.
- 4+ years of experience in data engineering, ML data infrastructure, or backend systems engineering — with direct experience building large‑scale data ingestion or crawling systems.
- Strong Python and systems engineering skills — you build robust, maintainable infrastructure, not just one‑off scripts.
- Hands‑on experience with distributed systems design: you've built systems that handle failure gracefully, scale horizontally, and recover cleanly.
- Experience with web crawling infrastructure at scale including handling rate limiting, deduplication, and content extraction.
- Proficiency with cloud platforms (AWS or GCP), object storage (S3/GCS), and container orchestration (Kubernetes).
- Comfort working with audio processing tooling — ffmpeg, librosa, torch audio, sox — and experience handling large volumes of audio files.
- Strong data quality instincts: you instrument pipelines, surface issues proactively, and treat data correctness with the same rigor as software correctness.
- Experience building speech or audio datasets for ASR, TTS, speech enhancement, or speaker verification model training.
- Familiarity with major open speech corpora — Common Voice, Libri Speech, Vox Populi, AISHELL — and their sourcing and quality characteristics.
- Experience with data versioning tools.
- Background in multilingual or low‑resource language data collection.
- Experience with annotation and labeling platforms.
- Familiarity with speaker diarization, language identification, or automated audio quality estimation models used for data filtering at scale.
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×