Principal Engineer,Model Development Platform Job Sunnyvale area,California USA,IT/Tech

Responsibilities

System architecture & reliability - Design and evolve the platform's overall architecture for reliability, observability, and scalability. Set performance, latency, and availability targets, and drive the engineering standards to meet them.
Cross-domain technical leadership - Unify the platform across disciplines, from front-end UIs and distributed training to Spark data pipelines and optimization-based experiment scheduling, ensuring systems interoperate cleanly.
Hands-on problem solving - Dive into the hardest challenges across subteams, lead architectural reviews, and propose pragmatic solutions that balance innovation with operational simplicity.
Experimentation & scheduling systems - Build systems that optimize how models are tested in simulation and on-road, using techniques like linear programming and heuristic optimization to balance hardware, safety, and research priorities while improving throughput and turnaround.
Data & compute infrastructure - Architect pipelines that ingest, transform, and enrich petabytes of fleet sensor data, and drive efficient compute use across GPU, CPU, cloud, and edge for both prototyping and large-scale training.
Strategic collaboration - Partner with Product, Research, and Operations to align architecture with user needs and co-own the platform's long-term roadmap.

About You

Essential

Technical Leadership at Scale – 10+ years of experience designing and building large-scale distributed systems, ML/AI infrastructure, full stack web application, or developer platforms, including at least 3 years as a staff or principal-level engineer.
Architectural Depth & Breadth – Proven ability to design systems spanning web platforms, ML pipelines, and large-scale compute orchestration (e.g., Spark, Ray,
Kubernetes
, Airflow, MLflow).
Reliability and performance – Experience driving platform reliability improvements, defining SLAs/SLOs, and building self-healing and observable systems that operate at “four nines” availability or better.
Hands-On Systems Design

#J-18808-Ljbffr

Principal Engineer, Model Development Platform