Principal Engineer - Perf and Benchmarking Job Bellevue area,Washington USA,IT/Tech

Core Weave is The Essential Cloud for AI. Built for pioneers by pioneers, Core Weave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, Core Weave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability.

Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at

About this role

We're looking for a Principal Engineer to be the technical lead of Core Weave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse:
Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.

You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications:
If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, Tensor

RT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate Core Weave's performance reliability leadership in the field.

What you'll do

* Strategy & Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.

* Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, Tensor

RT/Tensor

RT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.

* Internal Latency & Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises Core Weave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precis ions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.

* Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, Open Telemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).

* Cross-functional & Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/Deep Speed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.

Who you are

* 10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.

* Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).

* Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, Tensor

RT-LLM, Torch Serve), and distributed training frameworks (PyTorch FSDP/Deep Speed/Megatron-LM).

* Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.

* Excellent communicator able to interface with executives, customers, auditors, and OSS communities.

Nice to have

* Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.

* Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.

* Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.

* Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).

* Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.

The base salary range for this role is $206,000 to $333,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).

What We Offer

The range we've posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate…