×
Register Here to Apply for Jobs or Post Jobs. X

Architect, AI Cloud Platform

Job in Campbell, Santa Clara County, California, 95011, USA
Listing for: Oxmiq Labs
Full Time position
Listed on 2026-05-15
Job specializations:
  • Software Development
    DevOps, Software Engineer, Cloud Engineer - Software, Software Architect
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

OXMIQ designs GPU and AI silicon for large‑scale model inference and training and is developing an infrastructure and AI service orchestration platform that runs on heterogeneous accelerator hardware.

The Role

The Architect, AI Cloud Platform, owns the inference‑serving architecture of OXMIQ's infrastructure and AI service orchestration platform — the layer through which customer workloads are served from accelerator fleets  role is responsible for the end‑to‑end serving path: how a model is loaded, scheduled, batched, cached, dispatched, and routed across heterogeneous hardware to deliver competitive latency, throughput, and token‑per‑dollar economics.

The Architect must also have a working understanding of the broader platform layers on which inference serving depends — Kubernetes‑based orchestration, multi‑tenant isolation, observability, billing, and DC‑scale provisioning — and will collaborate with the engineering teams that own those layers to deliver a performant, integrated solution. Background and hands‑on experience with these layers is expected; ownership of their delivery is not.

The role is hands‑on. The Architect produces design documents, prototypes critical components, leads technical reviews, and works directly with engineering leads across each layer of the stack. The Architect also serves as a technical point of contact in selected customer and partner engagements.

Key Responsibilities
  • Own the inference‑serving architecture end to end, including model loading, continuous batching, KV‑cache management, prefix caching, request routing, and SLA‑aware scheduling across heterogeneous accelerators.
  • Lead the design of disaggregated prefill/decode deployments, including KV‑cache transfer (e.g., NIXL over RDMA / Infini Band / RoCE), KV‑cache‑aware request routing, and the orchestration patterns required to operate them at scale.
  • Define the integration model between OXMIQ's Capsule runtime and the open‑source inference‑serving stack (vLLM, SGLang, Tensor

    RT‑LLM, llm‑d, NVIDIA Dynamo, Triton Inference Server) so that serving workloads dispatch across heterogeneous silicon as a first‑class capability.
  • Partner with the orchestration team on the design of Kubernetes‑based scheduling for accelerator fleets, including multi‑tenant isolation, GPU and accelerator scheduling, and capacity management, ensuring it meets the needs of the inference‑serving layer.
  • Partner with the data‑center infrastructure team on DC‑scale provisioning, OS imaging, firmware, and burn‑in validation flows for AI pods running on OXMIQ and third‑party hardware, ensuring inference SLAs are achievable on the resulting fleet.
  • Conduct architecture and code reviews and provide technical guidance to engineering leads across inference, orchestration, runtime, security, monitoring, and platform UI.
  • Produce design documents, prototypes, and reference implementations for new platform components.
  • Serve as the technical representative of the platform architecture in selected customer and partner engagements.
Required Qualifications
  • 10+ years of platform, infrastructure, or cloud software engineering experience, with at least several years at a Principal Engineer or Architect level owning a multi‑component platform.
  • Deep, hands‑on experience with modern inference‑serving systems — vLLM, SGLang, Tensor

    RT‑LLM, Triton Inference Server, or comparable — at the level of operating them in production, modifying them, and understanding their internals.
  • Working knowledge of LLM serving optimizations
    : continuous batching, Paged Attention, prefix caching, KV‑cache management, speculative decoding, and quantization (FP8, INT8, or comparable) for inference.
  • Hands‑on experience with disaggregated prefill/decode architectures
    , including KV‑cache transfer mechanisms (NIXL, RDMA over Infini Band or RoCE), KV‑cache‑aware request routing, and the operational considerations of running disaggregated serving at scale.
  • Deep experience with Kubernetes at production scale: operators, CRDs, scheduling, multi‑tenancy, GPU and accelerator scheduling, and the operational realities of running it. Familiarity with Kubernetes‑native inference frameworks (llm‑d, NVIDIA…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary