Principal Engineer - Perf and Benchmarking
Job in
Bellevue, King County, Washington, 98009, USA
Listed on 2026-06-01
Listing for:
Core Weave
Full Time
position Listed on 2026-06-01
Job specializations:
-
IT/Tech
Systems Engineer, Data Engineer, AI Engineer, Data Science Manager
Job Description & How to Apply Below
Founded in 2017, Core Weave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at
About this role
We're looking for a Principal Engineer to be the technical lead of Core Weave's Benchmarking & Performance team. You will be responsible for our planet-scale performance data warehouse:
Ingesting, storing, transforming and analyzing performance events in all the data centers across our global infrastructure.
You will also be an integral part of achieving industry-leading end-to-end performance benchmarking publications:
If MLPerf (Training & Inference), Working closely with NVIDIA (Megatron-LM, Tensor
RT-LLM & DGX cloud) and the open-source community (llm-d, vLLM and all popular ML frameworks) speak to you, come help us demonstrate Core Weave's performance reliability leadership in the field.
What you'll do
* Strategy & Leadership - Define the multi-year benchmarking strategy and roadmap; prioritize models/workloads (LLMs, diffusion, vision, speech) and hardware tiers. Build, lead, and mentor a high-performing team of performance engineers and data analysts. Establish governance for claims: documented methodologies, versioning, reproducibility, and audit trails.
* Perf Ownership - Lead end-to-end MLPerf Inference and Training submissions: workload selection, cluster planning, runbooks, audits, and result publication. Coordinate optimization tracks with NVIDIA (CUDA, cuDNN, Tensor
RT/Tensor
RT-LLM, Triton, NCCL) to hit competitive results; drive upstream fixes where needed.
* Internal Latency & Throughput Benchmarks - Design a Kubernetes-native, repeatable benchmarking service that exercises Core Weave stacks across SUNK (Slurm on Kubernetes), Kueue, and Kubeflow pipelines. Measure and report p50/p95/p99 latency, jitter, tokens/s, time-to-first-token, cold-start/warm-start, and cost-per-token/request across models, precis ions (BF16/FP8/FP4), batch sizes, and GPU types. Maintain a corpus of representative scenarios (streaming, batch, multi-tenant) and data sets; automate comparisons across software releases and hardware generations.
* Tooling & Automation - Build CI/CD pipelines and K8s controllers/operators to schedule benchmarks at scale; integrate with observability stacks (Prometheus, Grafana, Open Telemetry) and results warehouses. Implement supply-chain integrity for benchmark artifacts (SBOMs, Cosign signatures).
* Cross-functional & Community - Partner with NVIDIA, key ISVs, and OSS projects (vLLM, Triton, KServe, PyTorch/Deep Speed, ONNX Runtime) to co-develop optimizations and upstream improvements. Support Sales/SEs with authoritative numbers for RFPs and competitive evaluations; brief analysts and press with rigorous, defensible data.
Who you are
* 10+ years building distributed systems or HPC/cloud services, with deep expertise on large-scale ML training or similar high-performance workloads.
* Proven track record of architecting or building planet-scale data systems (e.g., telemetry platforms, observability stacks, cloud data warehouses, large-scale OLAP engines).
* Deep understanding of GPU performance (CUDA, NCCL, RDMA, NVLink/PCIe, memory bandwidth), model-server stacks (Triton, vLLM, Tensor
RT-LLM, Torch Serve), and distributed training frameworks (PyTorch FSDP/Deep Speed/Megatron-LM).
* Proficient with Kubernetes and ML control planes; familiarity with SUNK, Kueue, and Kubeflow in production environments.
* Excellent communicator able to interface with executives, customers, auditors, and OSS communities.
Nice to have
* Experience with time-series databases, log-structured merge trees (LSM), or custom storage engine development.
* Experience running MLPerf submissions (Inference and/or Training) or equivalent audited benchmarks at scale.
* Contributions to MLPerf, Triton, vLLM, PyTorch, KServe, or similar OSS projects.
* Experience benchmarking multi-region fleets and large clusters (thousands of GPUs).
* Publications/talks on ML performance, latency engineering, or large-scale benchmarking methodology.
The base salary range for this role is $206,000 to $333,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility).
What We Offer
The range we've posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×