×
Register Here to Apply for Jobs or Post Jobs. X

Applied Scientist – Vision Language Models; Multimodal Reasoning

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: techire ai
Full Time position
Listed on 2026-06-12
Job specializations:
  • Engineering
    AI Evaluation, Artificial Intelligence
  • IT/Tech
    AI Evaluation, Artificial Intelligence
Salary/Wage Range or Industry Benchmark: 200000 - 320000 USD Yearly USD 200000.00 320000.00 YEAR
Job Description & How to Apply Below
Position: Applied Scientist – Vision Language Models (Multimodal Reasoning)

Applied Scientist – Vision Language Models (Multimodal Reasoning)

Ready to build VLMs that go beyond captioning and simple grounding?

This role is centred on advancing vision-language models that power intelligent agents operating in complex, real-world environments. The focus is firmly on multimodal model design, training, and post-training, with a mix of computer vision.

As an Applied Scientist, you’ll work on large multimodal models that integrate visual inputs with language-based reasoning. You’ll explore how VLMs can move from recognition and description toward structured understanding, task execution, and agentic decision-making.

Your work will include designing model architectures, improving cross-modal alignment, and developing post-training strategies that strengthen reasoning, factual consistency, and cont rollability. You’ll contribute across the full lifecycle, from data curation and supervised fine-tuning through to preference optimisation and evaluation.

This is a research-heavy role with clear production impact. You’ll prototype new ideas, run rigorous experiments, and collaborate with engineering teams to deploy models into live agent workflows.

Your focus will include:

  • Training and fine-tuning large-scale vision-language models
  • Improving multimodal alignment between image and text representations
  • Applying post-training techniques such as SFT, RLHF, DPO, and reward modelling
  • Designing evaluation frameworks for reasoning quality, grounding accuracy, and robustness
  • Working with large multimodal datasets, including synthetic and proprietary data

Hands‑on work with VLMs or multimodal foundation models is essential. Experience in post‑training, alignment, or preference learning is highly valued.

A solid understanding of how to evaluate multimodal systems, including hallucination, grounding failures, and reasoning gaps, is important. You should be comfortable reading and implementing recent research, and designing experiments that move models forward in measurable ways.

You’ll have ownership over modelling decisions and the opportunity to influence how multimodal intelligence is shaped within a fast‑growing AI team.

Compensation: $200,000 - $320,000 base (negotiable depending on level) + bonus + meaningful equity + benefits

Location: SF Bay Area or Miami (Hybrid). Remote flexibility in the short term.

If you’re motivated by pushing vision-language models toward deeper reasoning and real-world capability, we’d like to speak with you!

#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary