Applied Scientist – Vision Language Models; Multimodal Reasoning Job San Francisco area,California USA,Engineering

Position: Applied Scientist – Vision Language Models (Multimodal Reasoning)

Applied Scientist – Vision Language Models (Multimodal Reasoning)

Ready to build VLMs that go beyond captioning and simple grounding?

This role is centred on advancing vision-language models that power intelligent agents operating in complex, real-world environments. The focus is firmly on multimodal model design, training, and post-training, with a mix of computer vision.

As an Applied Scientist, you’ll work on large multimodal models that integrate visual inputs with language-based reasoning. You’ll explore how VLMs can move from recognition and description toward structured understanding, task execution, and agentic decision-making.

Your work will include designing model architectures, improving cross-modal alignment, and developing post-training strategies that strengthen reasoning, factual consistency, and cont rollability. You’ll contribute across the full lifecycle, from data curation and supervised fine-tuning through to preference optimisation and evaluation.

This is a research-heavy role with clear production impact. You’ll prototype new ideas, run rigorous experiments, and collaborate with engineering teams to deploy models into live agent workflows.

Your focus will include:

Training and fine-tuning large-scale vision-language models
Improving multimodal alignment between image and text representations
Applying post-training techniques such as SFT, RLHF, DPO, and reward modelling
Designing evaluation frameworks for reasoning quality, grounding accuracy, and robustness
Working with large multimodal datasets, including synthetic and proprietary data

Hands‑on work with VLMs or multimodal foundation models is essential. Experience in post‑training, alignment, or preference learning is highly valued.

A solid understanding of how to evaluate multimodal systems, including hallucination, grounding failures, and reasoning gaps, is important. You should be comfortable reading and implementing recent research, and designing experiments that move models forward in measurable ways.

You’ll have ownership over modelling decisions and the opportunity to influence how multimodal intelligence is shaped within a fast‑growing AI team.

Compensation: $200,000 - $320,000 base (negotiable depending on level) + bonus + meaningful equity + benefits

Location: SF Bay Area or Miami (Hybrid). Remote flexibility in the short term.

If you’re motivated by pushing vision-language models toward deeper reasoning and real-world capability, we’d like to speak with you!

#J-18808-Ljbffr