Senior AI Data Engineer Job Menlo Park area,California USA,IT/Tech

Duration: 7 months (with potential for extension)

As a Senior AI Data Engineer, you will design and operate end‑to‑end pipelines that not only move and transform data, but enrich it using ML models such as classifiers, embedding models, and large language models. The role sits at the intersection of data engineering and ML systems, requiring strong systems thinking around throughput, retries, async execution, and capacity management.

You will work closely with engineers and researchers to support image generation and evaluation workflows, contributing directly to data quality, model performance, and scalability.

Required Skills & Experience

Strong data engineering expertise, including advanced SQL, complex query optimization, and production pipeline orchestration (e.g., Airflow or equivalent)
Calling inference endpoints
Managing batching and throughput
Handling failures and retries at scale
Experience operating large-scale production pipelines with high reliability and performance requirements.
Proficiency using AI‑assisted coding tools to accelerate development, debugging, and code reviews.
Strong communication skills and ability to collaborate with engineers, researchers, and cross‑functional teams.

Preferred Qualifications

Experience working with embeddings and vector search, including storage, indexing, and similarity queries.
Familiarity with content understanding models, such as image classification, OCR, safety or quality scoring.
Experience using LLMs for data workflows, including automated annotation, data cleaning, or evaluation tasks.
Knowledge of generative AI systems, particularly image generation and corresponding evaluation metrics.
Background working in data engineering, ML engineering, or hybrid roles that support model training or evaluation.

Responsibilities

AI‑Augmented Data Pipelines:
Design and maintain large‑scale data pipelines (up to billions of records/images) that combine SQL-based transformations with ML model inference for data cleaning, labeling, and enrichment.
Remote Inference Orchestration:
Build and own systems that orchestrate remote model inference within pipelines, including batching, async execution, retries, fallback logic, and graceful degradation under load.
Feature & Embedding Pipelines:
Develop scalable pipelines to generate, store, validate, and serve vector embeddings. Manage nearest‑neighbor indexes and ensure data quality at scale.
Data Curation at Scale:
Source, filter, and curate training datasets using both structured queries and model‑derived signals (e.g., visual quality scores, content classification, safety filters). Own the end‑to‑end data lifecycle with a focus on quality, governance, and compliance.
LLM‑Assisted Annotation:
Design pipelines that use large language models and vision models for automated data annotation. Create auditing workflows to evaluate and improve annotation quality.
Shared Tooling & Frameworks:
Contribute reusable components and frameworks that simplify AI‑augmented data pipelines, such as standardized model‑invocation operators and async job orchestration patterns.

#J-18808-Ljbffr