Research Engineer: Spatial Perception and Reasoning Job San Jose area,California USA,Engineering

Position: Research Engineer: Spatial Perception and Reasoning …

Job Number: P25 F03

Honda Research Institute USA (HRI-US) is seeking a Research Engineer to advance multimodal world modeling and spatial intelligence for real-world AI systems. This is a hands‑on engineering role focused on developing and scaling robust learning systems that understand dynamic 3D environments, integrate vision, language, and temporal reasoning, and support predictive and adaptive behavior in embodied AI systems. The successful candidate will contribute to methods for spatial perception, geometric reasoning, scene understanding, video‑based reasoning, and multimodal representation learning in complex real-world environments.

San Jose, CA

Key Responsibilities

Develop models for 3D spatial perception, geometric reasoning, and dynamic scene understanding in real‑world environments.
Design and prototype multimodal learning systems that integrate vision, language, video, and temporal signals for spatial reasoning.
Build robust scene understanding systems capable of handling long‑tail, ambiguous, and edge‑case scenarios using large‑scale data, simulation, or generative approaches.
Develop world‑modeling and predictive‑reasoning methods, including learning‑based dynamics models, video prediction, and imagination‑driven planning.
Investigate multimodal representation learning approaches that align spatial, visual, linguistic, and temporal information.
Train, fine‑tune, evaluate, and optimize multimodal models such as VLMs, MLLMs, video‑language models, or related architectures.
Conduct benchmarking, error analysis, and experimental evaluations to improve robustness, generalization, and real‑world performance.
Collaborate with research teams to develop prototypes, publications, patents, and technical innovations.

Minimum Qualifications

Master’s degree or Ph.D. in Computer Science, Electrical Engineering, Robotics, Machine Learning, or a related field.
Strong experience designing and developing multimodal models, including VLMs, MLLMs, video‑language models, or related architectures.
Solid foundation in 3D spatial perception and geometric reasoning, with the ability to model spatial relationships in dynamic environments.
Experience building robust scene understanding systems for real‑world or simulated environments.
Familiarity with world models or predictive reasoning methods, such as learning‑based dynamics models, video prediction, or planning‑oriented representations.
Experience with multimodal representation learning that integrates vision, language, and temporal signals.
Proficiency in Python and modern deep learning frameworks, with the ability to rapidly prototype research ideas and systems.
Strong communication, presentation, and collaboration skills.

Bonus Qualifications

Experience with vision‑text embedding alignment, vision and language encoders, adapters, MLLM training stages, or multimodal fine‑tuning pipelines.
Knowledge of generative models such as Variational Autoencoders, Diffusion Models, or Generative Adversarial Networks.
Experience with action understanding tasks such as action segmentation, temporal alignment, action anticipation, or activity recognition.
Publications or research contributions in leading AI, machine learning, computer vision, or robotics venues such as CVPR, ICCV, ECCV, NeurIPS, ICLR, AAAI, RSS, CoRL, or ICRA.

Desired

Start Date:

9/14/2026

Position

#J-18808-Ljbffr