×
Register Here to Apply for Jobs or Post Jobs. X

Machine Learning Engineer - Computer Vision & Multi-Modal AI

Job in San Francisco, San Francisco County, California, 94103, USA
Listing for: Unity Technologies
Full Time position
Listed on 2026-06-27
Job specializations:
  • Engineering
    AI Engineer (Applied/Software)
  • IT/Tech
    Machine Learning/ ML Engineer, AI Engineer (Applied/Software)
Job Description & How to Apply Below
Position: Staff Machine Learning Engineer -  Computer Vision & Multi-Modal AI
** San Francisco, CA, USA*
* ** Staff Machine Learning Engineer -  Computer Vision & Multi-Modal AI*
* Location

San Francisco, CA, USA

Department

AI & Machine Learning

Requisition

JOBREQ-2616040

** Role description*
* ** The opportunity*
* We are building the next generation of AI-driven game experiences - generative world models, neural rendering, and multi-modal understanding that turn images, text, and 3D primitives into interactive worlds. As our Staff Machine Learning Engineer, you will be a core technical leader bringing state-of-the-art computer vision and multi-modal models - transformers, diffusion networks, vision-language models (VLMs), and JEPA-style architectures - from research into robust, production-grade systems.

This is a deeply hands-on, high-impact role. You will help define the modeling and deployment strategy, drive architectural decisions across the ML stack, and mentor a team of senior and mid-level engineers. Your work will directly shape the quality, capability, and performance of AI features experienced by billions of players - across cloud, server, and on-device targets.

** What you'll be doing*
* Technical Leadership

+ Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.

+ Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.

+ Make sound decisions on model architecture, training strategy, data pipelines, and evaluation - balancing quality, capability, latency, and cost across deployment targets.

+ Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.

Architecture & Research Translation

+ Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.

+ Design scalable systems for multi-modal inference that process diverse inputs images,

+ video, text, primitives, and metadata - and produce rich outputs from semantic

+ predictions to pixel-level generation.

+ Track and rapidly adopt breakthroughs across the field: vision-language pretraining and

+ alignment, efficient diffusion (e.g., consistency models, flow matching), efficient attention

+ e.g., Flash Attention, linear-attention variants), and tokenization/representation learning

+ for vision.

+ Where latency or device constraints demand it, apply compression, quantization, pruning, and knowledge distillation, and work with appropriate runtimes (e.g., TensorRT, ONNX Runtime, CoreML, TFLite) to meet performance budgets.

+ Team & Cross-Functional Leadership

+ Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and rigorous benchmarking and evaluation methodology.

+ Partner with research, platform engineers, product managers, and runtime teams to align ML capabilities with product roadmaps and target-platform constraints.

+ Champion a culture of measurement: define KPIs for model quality, accuracy, latency, memory, and cost, and ensure the team tracks them rigorously.

** What we're looking for*
* + 6+ years in ML engineering, with significant depth in computer vision and/or multi-modal modeling.

+ Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)

+ Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.

+ Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.

+ Strong Python and modern deep-learning tooling (PyTorch); solid software

+ engineering fundamentals.

+ Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.

** You might also have*
* +

Experience with wo…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary