Prem Platform Engineer
Listed on 2026-07-02
-
Software Development
AI Engineer (Applied/Software), Machine Learning/ ML Engineer, AI Reliability/ Performance Engineer
Role ::
On-prem Platform Engineer
Location:
Charlotte, NC Key Skills Must-Have Skills (Mandatory Keywords)
- LLM Inference & Optimization
- vLLM, TensorRT-LLM, Triton Inference Server, SGLang
- Inference optimization techniques
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization
- FP8, AWQ, GPTQ
- Tensor parallelism and large model scaling
- CUDA, NCCL, GPU architecture
- GPU partitioning & optimization (MIG)
- Kubernetes-based ML serving platforms
- KServe, Open Shift AI
- Helm charts, Operators, platform automation
- Run:
AI or similar GPU scheduling/orchestration platforms - Multi-tenant GPU workload management
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
- Prometheus, Grafana
- ML observability (model latency, throughput, drift, resource utilization)
- Performance benchmarking and tuning
- Experience with LLMOps / GenAI pipelines
- Exposure to hybrid cloud (on-prem + GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
Build, configure, and operate on prem Kubernetes/Open Shift AI platforms for deploying and serving GenAI models and LLM inference workloads.
Design and optimize high performance inference stacks using vLLM, TensorRT LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).
Manage GPU orchestration and capacity using Run:
AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.
Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.
Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.
Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.
Collaborate with ML and research teams to onboard new models, tune inference performance, and product ionize GenAI use cases.
#J-18808-Ljbffr(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).