More jobs:
Job Description & How to Apply Below
This is the person who thinks in tokens per second, VRAM budgets, and concurrency. If the difference between a 4-bit and an 8-bit deployment, or between loading five models and serving five LoRA adapters over one base, is second nature to you, this role is built for you.
What you will do
● Deploy and serve language and speech models in production using vLLM, TGI, TensorRT-LLM, or equivalent serving stacks.
● Choose and apply quantization strategies (INT8, 4-bit, FP8 where supported) to fit models onto available hardware without sacrificing quality.
● Design multi-model and multi-LoRA-adapter deployments: share a base model across many task-specialists, tune batching, and manage KV cache to maximize concurrency on a single card.
● Size deployments to real workloads: produce throughput-versus-latency curves, model peak-load and burst behavior, and decide when to scale by adding hardware.
● Stand up and operate GPU servers on both rented cloud (neocloud and hyperscaler) and owned or colocated physical hardware.
● Build the layer around the models: request routing and load balancing across cards, health monitoring, observability, and graceful handling of latency spikes.
● Make and defend hardware decisions: consumer versus data-center cards, training versus inference silicon, rent versus buy, matched to cost and compliance requirements.
● Own the deployment side of the model lifecycle: take finished models and adapters from the training team and get them running efficiently and reliably in production.
What we are looking for
● Demonstrated experience serving open models in production, not only calling hosted APIs.
● Fluency with at least one modern inference server (vLLM, TGI, TensorRT-LLM) and hands-on quantization experience.
● Solid understanding of GPU memory: what consumes VRAM, how KV cache scales, and how to fit models onto constrained hardware.
● Comfort provisioning and operating physical GPU servers, not only cloud abstractions. You are not afraid of drivers, CUDA, and a headless Linux box.
● Practical grasp of the difference between training and inference workloads, and why they favor different hardware.
● Systems thinking: you can reason about routing, batching, concurrency, and failure modes across a whole serving stack, not just a single model.
Nice to have
● Experience with speech models (Whisper or similar) alongside language models.
● Familiarity with LoRA and adapter-based fine-tuning, enough to collaborate closely with the training side.
● Exposure to regulated or compliance-sensitive deployments (ECC, data residency, SLA requirements).
● Cost-optimization instincts: committed versus on-demand pricing, spot capacity, right-sizing hardware to workload.
Note that applications are not being accepted from your jurisdiction for this job currently via this jobsite. Candidate preferences are the decision of the Employer or Recruiting Agent, and are controlled by them alone.
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
To Search, View & Apply for jobs on this site that accept applications from your location or country, tap here to make a Search:
Search for further Jobs Here:
×