Senior Inference Platform Engineer - Data Center
Job in
San Francisco, San Francisco County, California, 94199, USA
Listed on 2025-12-26
Listing for:
Hamilton Barnes Associates Limited
Full Time
position Listed on 2025-12-26
Job specializations:
-
IT/Tech
AI Engineer, Systems Engineer, Data Engineer, Machine Learning/ ML Engineer
Job Description & How to Apply Below
Join a stealth-mode hyperscale data center startup building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference.
Our client operates high-performance GPU clusters powering some of the most advanced AI workloads worldwide. They’re now building a serverless inference platform, beginning with cost-efficient batch inference and expanding into low-latency, real-time inference and custom model hosting. This is a unique chance to join at an early stage and help define the architecture, scalability, and technical direction of that platform.
If you are interested in this opportunity, get in tuch! You don't want to miss this opportunity!
Key Responsibilities- Take ownership of the inference platform architecture, from batch to low-latency workloads.
- Design, build, and optimise distributed inference systems to maximise GPU utilisation and minimise cold starts.
- Integrate, tune, and operate inference engines such as vLLM, SGLang, and Tensor
RT-LLM across multiple model types. - Develop APIs, orchestration layers, and autoscaling logic to support both multi-tenant and dedicated deployments.
- Collaborate with cross-functional teams to translate business and customer needs into robust technical solutions.
- Stay up to date with the latest models, serving frameworks, and optimisation techniques, applying best practices in performance and efficiency.
- Implement monitoring, alerting, and observability workflows for production systems.
- 5+ years’ experience building large-scale, fault-tolerant distributed systems (ML inference, HPC, or similar).
- Proficiency in Python, Go, Rust, or a comparable language.
- Strong understanding of GPU software stacks (CUDA, Triton, NCCL) and Kubernetes orchestration.
- Practical experience with model-serving frameworks such as vLLM, SGLang, Tensor
RT-LLM, or custom PyTorch deployments. - Knowledge of performance optimisation techniques, including batching, speculative decoding, quantisation, and caching.
- Familiarity with Infrastructure-as-Code tools (Terraform, Helm) and low-level OS performance tuning.
- Experience with event-driven or serverless architectures.
- Exposure to hybrid cloud or multi-cluster environments.
- Contributions to open-source ML or inference systems projects.
- Proven track record of cost optimisation in high-performance compute environments.
- Equity
- $300,000 gross per year
Position Requirements
10+ Years
work experience
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×