Research Scientist - Multimodal Representation Learning Job Fremont area,California USA,Engineering

Focus

Multimodal Foundation Models
· Representation Learning
· Method Innovation

We are looking for strong technical builders and researchers who deeply understand foundation models and representation learning beyond simply applying existing frameworks.

Ideal candidates should have:

Strong experimental rigor
Solid systems and modeling intuition
Hands‑on engineering ability
Interest in scalable multimodal AI systems for real‑world autonomy

We value people who can bridge research and production, and who care about robustness, scalability, efficiency, and practical deployment in large‑scale autonomous driving systems.

Responsibilities
1. Large‑Scale Foundation Model Pretraining

Develop scalable pretraining pipelines for large‑scale multimodal driving data
Design and optimize training strategies for:
- Vision‑language‑action models
- Video foundation models
- Long‑context temporal modeling
- Multimodal representation alignment
Improve:
- Training stability
- Data efficiency
- Scaling efficiency
- Representation robustness
Work on distributed training systems and large‑scale model optimization using frameworks such as:
- PyTorch Distributed
- Deep Speed
- Megatron‑LM

2. Representation Learning & Method Innovation

Design and improve self‑supervised and multimodal learning methods for real‑world autonomous driving systems
Conduct architecture‑level research on:
- Vision Transformers (ViT)
- Video / temporal architectures
- Multimodal fusion and alignment
- Embedding and retrieval systems
- Long‑context and memory‑efficient architectures
Explore and improve:
- Pretraining objectives
- Loss functions
- Training paradigms
- Generalization and robustness
Analyze model behavior through:
- Rigorous ablation studies
- Failure case analysis
Representation probing and evaluation

3. Efficient Foundation Models & Scalable Deployment

Improve the efficiency, scalability, and deployability of large multimodal foundation models for real‑world autonomous driving systems
Work on areas such as:
- Model quantization
- Knowledge distillation
- Efficient attention mechanisms
- Sparse architectures and Mixture‑of‑Experts (MoE)
- Long‑context and memory‑efficient modeling
- Inference acceleration and serving optimization
- Training and inference system efficiency
Optimize model throughput, latency, memory usage, and deployment performance for large‑scale production environments

Qualifications

MS or PhD in:

Computer Vision
Machine Learning
Robotics
Computer Science
Related fields

Strong understanding of:

Foundation models
Self‑supervised learning
Representation learning
Multimodal learning
Large‑scale pretraining

Hands‑on experience with methods such as:

CLIP
DINO / DINOv2
MAE
Contrastive learning
Masked modeling
MoE or scalable transformer architectures

Experience with one or more of the following is highly valued:

Video foundation models
Long‑context modeling
Retrieval systems
Efficient inference
Distributed training
Model compression and deployment optimization

Strong publication record in top‑tier venues is preferred:

CVPR
ICCV
ECCV
NeurIPS
ICLR
ICML

#J-18808-Ljbffr