Architect,AI Cloud Platform Job Campbell area,California USA,Software Development

OXMIQ designs GPU and AI silicon for large‑scale model inference and training and is developing an infrastructure and AI service orchestration platform that runs on heterogeneous accelerator hardware.

The Role

The Architect, AI Cloud Platform, owns the inference‑serving architecture of OXMIQ's infrastructure and AI service orchestration platform — the layer through which customer workloads are served from accelerator fleets role is responsible for the end‑to‑end serving path: how a model is loaded, scheduled, batched, cached, dispatched, and routed across heterogeneous hardware to deliver competitive latency, throughput, and token‑per‑dollar economics.

The Architect must also have a working understanding of the broader platform layers on which inference serving depends — Kubernetes‑based orchestration, multi‑tenant isolation, observability, billing, and DC‑scale provisioning — and will collaborate with the engineering teams that own those layers to deliver a performant, integrated solution. Background and hands‑on experience with these layers is expected; ownership of their delivery is not.

The role is hands‑on. The Architect produces design documents, prototypes critical components, leads technical reviews, and works directly with engineering leads across each layer of the stack. The Architect also serves as a technical point of contact in selected customer and partner engagements.

Key Responsibilities

Own the inference‑serving architecture end to end, including model loading, continuous batching, KV‑cache management, prefix caching, request routing, and SLA‑aware scheduling across heterogeneous accelerators.
Lead the design of disaggregated prefill/decode deployments, including KV‑cache transfer (e.g., NIXL over RDMA / Infini Band / RoCE), KV‑cache‑aware request routing, and the orchestration patterns required to operate them at scale.
Define the integration model between OXMIQ's Capsule runtime and the open‑source inference‑serving stack (vLLM, SGLang, Tensor

RT‑LLM, llm‑d, NVIDIA Dynamo, Triton Inference Server) so that serving workloads dispatch across heterogeneous silicon as a first‑class capability.
Partner with the orchestration team on the design of Kubernetes‑based scheduling for accelerator fleets, including multi‑tenant isolation, GPU and accelerator scheduling, and capacity management, ensuring it meets the needs of the inference‑serving layer.
Partner with the data‑center infrastructure team on DC‑scale provisioning, OS imaging, firmware, and burn‑in validation flows for AI pods running on OXMIQ and third‑party hardware, ensuring inference SLAs are achievable on the resulting fleet.
Conduct architecture and code reviews and provide technical guidance to engineering leads across inference, orchestration, runtime, security, monitoring, and platform UI.
Produce design documents, prototypes, and reference implementations for new platform components.
Serve as the technical representative of the platform architecture in selected customer and partner engagements.

Required Qualifications

10+ years of platform, infrastructure, or cloud software engineering experience, with at least several years at a Principal Engineer or Architect level owning a multi‑component platform.
Deep, hands‑on experience with modern inference‑serving systems — vLLM, SGLang, Tensor

RT‑LLM, Triton Inference Server, or comparable — at the level of operating them in production, modifying them, and understanding their internals.
Working knowledge of LLM serving optimizations
: continuous batching, Paged Attention, prefix caching, KV‑cache management, speculative decoding, and quantization (FP8, INT8, or comparable) for inference.
Hands‑on experience with disaggregated prefill/decode architectures
, including KV‑cache transfer mechanisms (NIXL, RDMA over Infini Band or RoCE), KV‑cache‑aware request routing, and the operational considerations of running disaggregated serving at scale.
Deep experience with Kubernetes at production scale: operators, CRDs, scheduling, multi‑tenancy, GPU and accelerator scheduling, and the operational realities of running it. Familiarity with Kubernetes‑native inference frameworks (llm‑d, NVIDIA…