Machine Learning Engineer, Inference & Serving; Speech LLM - San Francisco
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2026-06-17
Listing for:
Plaud
Full Time
position Listed on 2026-06-17
Job specializations:
-
Software Development
AI Engineer (Applied/Software), Machine Learning/ ML Engineer, Backend Developer
Job Description & How to Apply Below
You may be a good fit if you:
- Have hands‑on experience building and deploying high‑throughput, ultra‑low‑latency inference engines for large language models or foundational speech models.
- Understand the intricate trade‑offs between latency, throughput, and Time‑To‑First‑Token (or Time‑To‑First‑Audio) in real‑time streaming environments.
- Have practical experience with continuous batching, KV cache management (e.g., Paged Attention), and stateful connections necessary for real‑time conversational AI.
- Possess a deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
- Communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
- Thrive in fast‑moving environments and genuinely enjoy the systems‑engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
- Are obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware‑software AI companion that amplifies human productivity.
- Frontier Serving Frameworks:
Deep, under‑the‑hood familiarity with modern LLM serving frameworks like vLLM, TensorRT‑LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open‑source contributions to these repositories). - Real‑Time Audio Streaming:
Experience handling continuous audio streams over Web Sockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency. - Advanced Inference Techniques:
Implementing cutting‑edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill. - Model Compression & Quantization:
Hands‑on experience with post‑training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy. - Large‑Scale Distributed Systems:
Deploying multi‑GPU (Tensor Parallelism) and multi‑node inference pipelines, and managing autoscaling infrastructure using Kubernetes.
- Founding Team Initiative:
Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast‑growing startup. - Competitive Compensation: $200K - $540K base salary + performance bonus + Equity.
- Comprehensive Benefits:
Top‑tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy. - Retirement Planning: 401(k) plan for full‑time employees with company matching.
- Paid Time Off:
Unlimited PTO, plus 13 paid holidays. - New Parent Leave: 12 weeks of paid time off to spend time with your new family, regardless of gender.
- Hybrid Office:
Minimum of 3x in‑office per week to foster highly collaborative, fast‑paced research. - Gear & Perks:
Choice of top‑of‑the‑line laptops/workstations, annual offsites, and a fully stocked office.
Plaud is and will continue to be an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristics.
#J-18808-LjbffrTo View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×