×
Register Here to Apply for Jobs or Post Jobs. X

Multimodal Research Engineer: AI Video & Image Gen

Job in San Francisco, San Francisco County, California, 94199, USA
Listing for: Character.ai
Full Time position
Listed on 2026-06-02
Job specializations:
  • IT/Tech
    AI Engineer (Applied/Software)
Salary/Wage Range or Industry Benchmark: 80000 - 100000 USD Yearly USD 80000.00 100000.00 YEAR
Job Description & How to Apply Below

Requirements

  • Strong passion for pushing the boundaries of visual AI, with a self-driven, hands-on approach to solving complex technical problems
  • ,
  • Proficient in PyTorch with end-to-end experience across data processing, model training, and deployment
  • ,
  • Solid understanding of video and image generation architectures, including diffusion models, DiT, Control Net, and SOTA video generation models
  • ,
  • Experience with multimodal model training, including working with audio, vision, and language modalities together
  • ,
  • Experience with distributed training tools (FSDP, Deep Speed, etc.)
  • ,
  • Experience with large-scale data processing, dataset construction, and automated data cleaning
  • ,
  • (Desirable) Experience with joint audio-visual or speech-conditioned generation models
  • ,
  • (Desirable) Experience with AIGC, video effects, character animation, or asset generation products
  • ,
  • (Desirable) Familiarity with ML deployment and orchestration (Kubernetes, Slurm, Docker, cloud platforms)
  • ,
  • (Desirable) Publications in relevant venues (NeurIPS, ICLR, CVPR, ECCV, ICCV, or similar)
What the job involves
  • Joining us as a Research Engineer on the Multimodal team, you'll be at the forefront of building and advancing video and image generation models that bring AI characters to life in entirely new ways
  • ,
  • Your work will directly shape how millions of users experience rich, expressive, and visually compelling AI interactions every day
  • ,
  • The Multimodal team is responsible for training, fine-tuning, and deploying cutting-edge image, audio and video generation models that power Character.

    AI's visual experiences
  • ,
  • We work across the full model lifecycle — from data pipelines and training to deployment and product integration
  • ,
  • As a Multimodal Research Engineer, you will own and advance our video model training efforts, including joint audio-visual generation and image-to-video. You will collaborate across research, product, and infrastructure to push the boundaries of what AI-generated visuals can look and feel like at scale
  • ,
  • Lead fine-tuning and continued training of video generation models, including image-to-video and joint audio-visual generation
  • ,
  • Design and experiment with novel model architectures for multimodal generation, including multimodal conditioning (voice, structured text, reference images)
  • ,
  • Leverage techniques such as LoRA, RLHF, and full-parameter fine-tuning to improve model quality across diverse visual scenarios
  • ,
  • Design and build large-scale data pipelines and automated annotation workflows to support continuous model improvement
  • ,
  • Explore model compression, inference acceleration, and serving optimizations to enable efficient real-time video processing at scale
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary