×
Register Here to Apply for Jobs or Post Jobs. X

Prem Platform Engineer

Job in Charlotte, Mecklenburg County, North Carolina, 28245, USA
Listing for: TalentOla
Full Time position
Listed on 2026-07-02
Job specializations:
  • Software Development
    AI Engineer (Applied/Software), Machine Learning/ ML Engineer, AI Reliability/ Performance Engineer
Salary/Wage Range or Industry Benchmark: 100000 - 130000 USD Yearly USD 100000.00 130000.00 YEAR
Job Description & How to Apply Below
Position: On-prem Platform Engineer

Role ::
On-prem Platform Engineer

Location:

Charlotte, NC Key Skills Must-Have Skills (Mandatory Keywords)

  • LLM Inference & Optimization
    • vLLM, TensorRT-LLM, Triton Inference Server, SGLang
    • Inference optimization techniques
      • Continuous batching
      • Speculative decoding
      • KV cache / Prefix caching
    • Model optimization
      • FP8, AWQ, GPTQ
Distributed & GPU Systems
  • Tensor parallelism and large model scaling
  • CUDA, NCCL, GPU architecture
  • GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
  • Kubernetes-based ML serving platforms
  • KServe, Open Shift AI
  • Helm charts, Operators, platform automation
GPU Orchestration
  • Run:

    AI or similar GPU scheduling/orchestration platforms
  • Multi-tenant GPU workload management
Platform Engineering
  • Experience building internal AI/ML platforms (on-prem or hybrid)
  • Strong automation and system design mindset
Observability & Performance
  • Prometheus, Grafana
  • ML observability (model latency, throughput, drift, resource utilization)
  • Performance benchmarking and tuning
Good to Have / Preferred Skills
  • Experience with LLMOps / GenAI pipelines
  • Exposure to hybrid cloud (on-prem + GCP/Azure integration)
  • Familiarity with Inferentia / alternative accelerators
  • Knowledge of service mesh / networking in GPU clusters

Build, configure, and operate on prem Kubernetes/Open Shift AI platforms for deploying and serving GenAI models and LLM inference workloads.

Design and optimize high performance inference stacks using vLLM, TensorRT LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).

Manage GPU orchestration and capacity using Run:

AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.

Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.

Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.

Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.

Collaborate with ML and research teams to onboard new models, tune inference performance, and product ionize GenAI use cases.

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary