Senior Data & MLOps Engineer Job London area,Greater London England UK,Software Development

Location: Greater London

Core Weave is The Essential Cloud for AI™. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025.

Learn more at

We’re proud to be a Living Wage accredited Employer.

What You’ll Do:

The Data Science team is focused on developing an advanced reliability platform. This system covers various aspects of data processing and analysis, including data intake, deriving meaningful metrics, identifying unusual patterns, predicting potential issues, finding slow processes in distributed systems, and using automated analysis to determine causes. We collaborate closely with internal teams like Fleet, Infrastructure, and AI Platform to enhance system stability, optimize resource use, shorten resolution times, and maintain service availability and financial performance.

About the role:

As a Senior Data & MLOps Engineer, you will design and scale the infrastructure supporting the GPU Intelligence Platform. This involves building pipelines for handling data, features, model training, and delivering insights and predictions for system health and optimization. You will transition the system from initial prototypes to a production environment operating across the fleet, focusing on scalability, separating real‑time service from periodic processing, and dynamic resource management based on system load and data frequency.

You will architect and deploy these scalable distributed services using orchestration technologies.

Key responsibilities:

Design and implement scalable data ingestion pipelines.
Build feature processing and baseline computation systems.
Productionize models for prediction and detection.
Develop and operate low‑latency service and robust offline workflows.
Architect horizontally scalable services with clear separation between components, leveraging orchestration for distribution.
Implement monitoring and feedback loops for continuous model and signal improvement.
Collaborate with Platform teams to integrate operational signals into monitoring and diagnostics.
Implement a scalable solution for mitigation and structured analysis.

Who You Are:

7+ years of experience in data engineering, distributed systems, MLOps, or infrastructure ML roles in production environments.
Proven experience building high-throughput streaming or telemetry pipelines (e.g., Kafka, Pulsar, Kinesis, or equivalent).
Strong experience designing time‑series feature pipelines and operating large‑scale observability systems.
Experience building and maintaining feature stores and ensuring offline/online feature parity.
Hands‑on experience deploying ML models to production, including versioning, monitoring, rollback, and drift detection.
Experience designing scalable microservices deployed in Kubernetes‑based environments.
Strong proficiency in Python and at least one systems language (Go, Rust, or C++).
Experience working with distributed compute or training systems (e.g., NCCL, PyTorch Distributed, Spark, Ray, Slurm).
Familiarity with GPU telemetry systems such as NVML or DCGM and hardware‑level monitoring concepts.
Demonstrated experience scaling systems from Proof‑of‑Concept to production‑grade, fleet‑level deployments.

Preferred:

Experience working on GPU fleet management, hyperscale infrastructure, or AI training clusters.
Experience building anomaly detection or failure prediction systems for hardware or distributed systems.
Experience implementing distributed straggler detection or collective‑level performance analysis systems.
Experience developing agentic or LLM‑powered reasoning systems for diagnostics or operational intelligence.
Background in reliability engineering or SRE practices.

Wondering if you’re a good fit?

We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren’t…