Principal ML Data Platform Engineer
Listed on 2026-02-16
-
IT/Tech
Data Engineer, Machine Learning/ ML Engineer
Senior/Principal Backend Engineer, ML Data Platform (0→1 Build)
This company is an emerging AI and data-infrastructure startup building a next-generation platform for processing large volumes of sensitive, unstructured data. They are hiring a Senior/Principal Backend Engineer to build their machine learning data infrastructure from the ground up.
This is a highly autonomous, high-ownership role for an engineer who thrives in ambiguity and can independently architect systems, build pipelines, and design ML experiment frameworks without depending on an existing data science team. The engineering culture requires in-person collaboration in San Francisco four days per week.
THE ROLEYou will be responsible for creating the entire ML data and experimentation platform, including systems for model evaluation, versioning, data ingestion, and large‑scale processing. The work spans backend engineering, ML evaluation frameworks, and data‑pipeline architecture.
KEY RESPONSIBILITIES- Build end‑to‑end evaluation pipelines for NLP and classification models
- Design frameworks for experiment tracking, rapid model iteration, and A/B testing
- Architect data flows across databases, cloud storage, and distributed compute environments
- Create reproducible ML pipelines that function in both cloud and on‑prem setups
- Build tooling for ingesting and processing diverse unstructured data, including text, transcripts, and PDFs
- Establish foundational MLOps practices and model‑performance benchmarking
- Own the full pipeline from raw data ingestion through dataset generation
- Standing up ML infrastructure from scratch
- Developing evaluation systems for NER and classification models
- Bridging structured databases with large data‑lake environments
- Optimizing distributed compute jobs across Spark, Databricks, and on‑prem clusters
- Scaling pipelines to very large data volumes
- Operating without a staffed data science function
- 5+ years backend engineering experience with deep data‑pipeline exposure
- Significant Spark experience (preferably PySpark), cloud + on‑prem hybrid familiarity
- Ability to design ML experiments and evaluate model performance
- Strong Python skills and comfort with ML toolkits
- Experience with Postgre
SQL, S3/Parquet, and distributed batch processing - NER/NLP understanding and prior ML‑infrastructure experience
- Bonus: exposure to audio or document‑processing pipelines
- Spark (cloud + on‑prem)
- PostgreSQL
- S3‑based data lakes
- Batch processing workflows
- NLP and classification model evaluation at scale
- Production‑ready evaluation pipelines within 90 days
- Reliable experiment‑tracking system that accelerates model‑performance iteration
- Scalable data infrastructure capable of supporting high‑volume workloads
- Faster model‑improvement cycles through effective sampling and evaluation design
- You are the founding owner of ML infrastructure
- Massive ownership across architecture, systems, and experimentation
- Direct, measurable impact on model quality and platform capabilities
- Opportunity to define best practices, standards, and systems from day one
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).