Machine Learning Engineer - Computer Vision & Multi-Modal AI
Job in
San Francisco, San Francisco County, California, 94103, USA
Listed on 2026-06-27
Listing for:
Unity Technologies
Full Time
position Listed on 2026-06-27
Job specializations:
-
Engineering
AI Engineer (Applied/Software) -
IT/Tech
Machine Learning/ ML Engineer, AI Engineer (Applied/Software)
Job Description & How to Apply Below
** San Francisco, CA, USA*
* ** Staff Machine Learning Engineer - Computer Vision & Multi-Modal AI*
* Location
San Francisco, CA, USA
Department
AI & Machine Learning
Requisition
JOBREQ-2616040
** Role description*
* ** The opportunity*
* We are building the next generation of AI-driven game experiences - generative world models, neural rendering, and multi-modal understanding that turn images, text, and 3D primitives into interactive worlds. As our Staff Machine Learning Engineer, you will be a core technical leader bringing state-of-the-art computer vision and multi-modal models - transformers, diffusion networks, vision-language models (VLMs), and JEPA-style architectures - from research into robust, production-grade systems.
This is a deeply hands-on, high-impact role. You will help define the modeling and deployment strategy, drive architectural decisions across the ML stack, and mentor a team of senior and mid-level engineers. Your work will directly shape the quality, capability, and performance of AI features experienced by billions of players - across cloud, server, and on-device targets.
** What you'll be doing*
* Technical Leadership
+ Help set the technical vision and roadmap for computer vision and multi-modal AI models, spanning transformers, diffusion models, vision-language models, and JEPA-style generative architectures.
+ Drive design and implementation of models for image and video understanding, generation, segmentation, detection, and dense prediction, as well as multi-modal reasoning over images, text, and 3D inputs.
+ Make sound decisions on model architecture, training strategy, data pipelines, and evaluation - balancing quality, capability, latency, and cost across deployment targets.
+ Own the path from research prototype to production: training, fine-tuning, distillation, export, and serving, with deployment spanning cloud GPUs through to efficient on-device inference where the product requires it.
Architecture & Research Translation
+ Collaborate directly with research scientists to translate novel CV and multi-modal model architectures into deployable, well-engineered implementations.
+ Design scalable systems for multi-modal inference that process diverse inputs images,
+ video, text, primitives, and metadata - and produce rich outputs from semantic
+ predictions to pixel-level generation.
+ Track and rapidly adopt breakthroughs across the field: vision-language pretraining and
+ alignment, efficient diffusion (e.g., consistency models, flow matching), efficient attention
+ e.g., Flash Attention, linear-attention variants), and tokenization/representation learning
+ for vision.
+ Where latency or device constraints demand it, apply compression, quantization, pruning, and knowledge distillation, and work with appropriate runtimes (e.g., TensorRT, ONNX Runtime, CoreML, TFLite) to meet performance budgets.
+ Team & Cross-Functional Leadership
+ Lead and mentor a team of ML engineers; define engineering best practices, code review standards, and rigorous benchmarking and evaluation methodology.
+ Partner with research, platform engineers, product managers, and runtime teams to align ML capabilities with product roadmaps and target-platform constraints.
+ Champion a culture of measurement: define KPIs for model quality, accuracy, latency, memory, and cost, and ensure the team tracks them rigorously.
** What we're looking for*
* + 6+ years in ML engineering, with significant depth in computer vision and/or multi-modal modeling.
+ Proven production experience with transformer-based and diffusion-based vision models (e.g., ViT, CLIP/SigLIP-style encoders, Stable Diffusion, DETR/SAM-style architectures)
+ Strong command of the full model lifecycle: data curation, training and fine-tuning, evaluation, and serving at scale.
+ Familiarity with efficient attention, diffusion samplers, multi-modal fusion, and vision-language alignment techniques.
+ Strong Python and modern deep-learning tooling (PyTorch); solid software
+ engineering fundamentals.
+ Track record of technical leadership: setting direction, influencing cross-functional partners, and growing engineers.
** You might also have*
* +
Experience with wo…
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
Search for further Jobs Here:
×