Inference Engine Development - Member of Technical Staff Job London area,Greater London England UK,Engineering

Location: Greater London

About Us

Artificial intelligence scaled on a bet - that bigger models, more identical chips, and more data would keep delivering. As problems grow more complex and the requirements of intelligence more diverse, that bet is breaking down. The next era belongs to heterogeneous intelligence: diverse models on diverse chips, each with distinct strengths, co-evolving into systems of capability unreachable by any single model or accelerator.

Callosum is the Intelligent Systems company. We built the infrastructure to make that possible. Our co-evolution engine optimises simultaneously across workflows, agents, and silicon. We launched in early 2026 showing orders of magnitude improvements in performance and a shift in the cost-performance frontier that no single chip or model provider can provide.

We believe intelligence comes from the system, not the model.

We are scientists and engineers solving what others consider impossible. If you thrive on hard problems, and are passionate and energised by the scale of the challenge, we'd love to hear from you.

About the Role

Callosum believes that orders of magnitude improvements in AI systems will come through application-aware orchestration across heterogeneous hardware. We are building that vision: infrastructure that treats the full landscape of compute as a unified, co-evolving system, evolved beyond GPUs.

Inference engines were designed for single-model inference on homogeneous GPU clusters - this role builds them beyond that. Working directly on systems like vLLM and SGLang, you will adapt and extend them for heterogeneous resources, making them hardware-aware, with deeper optimisation around scheduling, memory, and execution. The execution strategies you design - parallelism, disaggregation, caching - will define what heterogeneous inference looks like at production scale.

Your work ensures that the capabilities exposed by the lower layers of the stack translate into real, measurable gains, the new standard for how inference runs on diverse hardware.

What You'll Build

Contribute upstream to SGLang and vLLM, and maintain internal forks where our requirements diverge
Improve hardware-awareness within inference engines so that scheduling, memory management, and execution adapt to the capabilities of the underlying accelerator
Design and implement bespoke parallelism and disaggregation strategies that go beyond default configurations to better exploit heterogeneous hardware
Work closely with an Accelerator Systems Software engineer to ensure engine-level abstractions map cleanly onto diverse hardware capabilities

What You Bring

Deep familiarity with the internals of SGLang, vLLM, or comparable inference serving frameworks - scheduler design, memory management, and execution pipelines
Strong background in high-performance Python and C++/CUDA systems, particularly in the context of ML inference
Experience designing or implementing parallelism strategies for large model serving
Understanding of disaggregated serving architectures and the tradeoffs involved in separating modules of a workflow
Demonstrable record of working effectively in fast-moving open source codebases with evolving APIs and design conventions

#J-18808-Ljbffr