Member of Technical Staff — Model Optimization and Inference; New Grad
Listed on 2026-06-14
-
Software Development
AI Engineer (Applied/Software), Machine Learning/ ML Engineer, Software Engineer, Backend Developer
About Nuance Labs
Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full‑duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person.
AboutThe Role
We can train a great model, but the next problem is making it fast enough to actually use in a real‑time conversation. A model that responds in 3 seconds is a demo; a model that responds in under 500 ms is a product. We’re looking for someone who’s excited about taking trained models and squeezing every last millisecond out of them. You understand—or want to deeply understand—the full stack from model weights to serving infrastructure: quantization, KV cache optimization, kernel‑level acceleration, and batching strategies.
You’ve worked with vLLM, SGLang, or similar frameworks (through coursework, research, internships, or open‑source) and have opinions about where they fall short.
This posting is aimed at early‑career engineers finishing or recently finished with a BS, MS, or PhD. We don’t require a PhD – we care about systems intuition, engineering chops, and the appetite to go deep.
What You’ll Do- Contribute to end‑to‑end inference optimization across our model stack—LLMs, audio models, and diffusion‑based components
- Implement and tune KV cache strategies for long‑context conversations, including eviction policies, compression, and memory‑efficient attention
- Work with inference serving frameworks (vLLM, SGLang, Tensor
RT‑LLM, etc.) and extend them for our specific workloads - Profile and benchmark end‑to‑end latency and throughput; identify and systematically eliminate bottlenecks
- Build internal tooling that makes optimization work faster and more rigorous—profiling viewers, end‑to‑end inference test harnesses, and other infrastructure that helps the team move quickly
- Accelerate diffusion model inference—consistency models, step distillation, caching strategies, and custom kernel optimizations
- Apply quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
- Work closely with research and infrastructure to ensure new models ship with optimized serving from day one
- BS, MS, or PhD in CS, ML, or a related field—completed or in the final stretch
- Strong fundamentals in LLM inference or ML systems—KV caching, memory layout, attention kernels, batching, or serving—picked up through coursework, research, internships, or open‑source. You don’t need to have shipped at production scale yet; you do need to learn fast and go deep.
- Exposure to inference serving frameworks (vLLM, SGLang, Tensor
RT‑LLM, or similar)—even at a research or hobby level - Strong Python and PyTorch skills; familiarity with CUDA or Triton is a significant plus
- A systematic approach to profiling and optimization— you measure first, then optimize
- Curiosity about diffusion inference, speculative decoding, quantization, or other inference‑time acceleration techniques
- Internship or research experience with LLM inference, ML systems, or model serving
- Contributions to open‑source inference frameworks (vLLM, SGLang, Tensor
RT‑LLM, etc.) - CUDA / Triton kernel work, even at a research or hobby scale
- Publications or research projects in MLSys, model compression, or inference optimization
- Familiarity with multimodal or streaming inference architectures
- Experience with hard latency SLAs in any real‑time system
$200,000 – $300,000 base salary, plus meaningful equity. We think long‑term ownership matters and structure equity accordingly.
Logistics- Location:
In‑person in Seattle, five days a week — we believe in the compounding value of working shoulder‑to‑shoulder. - Visa sponsorship:
We sponsor visas (O‑1, H‑1B, green card) from day one. - AI‑native tooling:
Do your best work with the best tools, including unlimited tokens.
- Health: HSA plan with ~$2,000 in annual company contributions — roughly 2× what most big tech companies put in.
- Time off: 15 days of PTO plus public holidays, and we close the office for a full week at year‑end.
- Food:
Lunch, drinks, and snacks on us every workday — the small thing that quietly makes the day better. - Commuter benefits:
We help cover the cost of getting to the office. - 401(k):
In the works.
Nuance Labs is an equal opportunity employer. We believe diverse teams build better AI.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).