One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing
Building a single mannequin that may each perceive and generate pictures and movies is tougher than it sounds. The two duties pull in reverse instructions. Understanding advantages from high-level semantic options tightly aligned with language. Generation wants low-level steady representations that protect texture, geometry, and temporal dynamics. Most methods deal with this stress by separating the 2 into distinct architectures, then bridging them post-hoc.
ByteDance analysis group took a unique strategy with Lance. Rather than assembling separate elements, the analysis group designed a mannequin that natively integrates understanding, technology, and modifying throughout each picture and video modalities — skilled collectively from the beginning.

What Lance Can Do
Lance organizes its capabilities into three output households: textual content (X2T), pictures (X2I), and movies (X2V). On the understanding aspect, this covers picture and video captioning, visible query answering, OCR, visible grounding, and reasoning. On the technology aspect, it handles text-to-image, text-to-video, image-to-video, subject-driven technology, picture modifying, and video modifying — together with multi-turn consistency modifying throughout each modalities.
This all-in-one functionality is a significant milestone. While commonplace unified architectures usually cease at fundamental picture understanding and text-to-image technology, Lance is among the many few to natively bridge the whole image-video ecosystem throughout each understanding and technology duties.

How the Architecture Works
The structure relies on two rules: unified context modeling and decoupled functionality pathways.
For unified context, Lance converts all inputs — textual content, pictures, and movies — right into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visible inputs, the Qwen2.5-VL ViT encoder produces compact semantic visible tokens. For generation-oriented visible inputs, the Wan2.2 3D causal VAE encoder encodes pictures and movies into steady latent representations, making use of 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token varieties — textual content, semantic visible, and latent visible — stay in the identical sequence. The mannequin then runs generalized 3D causal consideration over the complete context, with textual content tokens utilizing causal consideration and visible tokens utilizing bidirectional consideration.
For decoupled pathways, Lance makes use of a dual-stream mixture-of-experts structure initialized from Qwen2.5-VL 3B. The understanding knowledgeable (LLMUND) handles textual content and semantic visible tokens, producing outputs for multimodal reasoning and textual content technology. The technology knowledgeable (LLMGEN) handles VAE latent tokens for visible synthesis and modifying. Crucially, each consultants function over the identical shared interleaved sequence — they share context however don’t compete for the identical parameters. The understanding knowledgeable is skilled with a next-token prediction loss; the technology knowledgeable is skilled with a move matching goal in steady latent house. The two losses are mixed with configurable weights all through coaching.
Modality-Aware Rotary Positional Encoding (MaPE)
Running ViT semantic tokens, clear VAE situation tokens, and noisy VAE goal tokens by the identical sequence creates a refined drawback. Standard 3D-RoPE encodes positions primarily based on spatiotemporal structure alone — it has no technique to inform these token teams aside. When a number of visible token teams occupy the identical sequence, their positional boundaries turn out to be ambiguous, which might damage cross-task alignment.
Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to repair this. MaPE applies a hard and fast temporal offset to every modality group primarily based on its index within the sequence. Spatial coordinates keep unchanged, so the intrinsic structure inside pictures and movies is preserved. The temporal offset alone is sufficient to separate the token teams within the world positional house with out disrupting temporal ordering inside any particular person video.
Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to six.30, and VBench from 81.81 to 80.95 — constant degradation throughout technology, modifying, and understanding.
Training: Four Stages, One Unified Framework
Lance is skilled by 4 sequential levels, every constructing on the final.
Pre-Training (PT) lays the muse utilizing roughly 1B image-text and 140M video-text pairs, overlaying 1.5T coaching tokens. This stage establishes fundamental multimodal alignment and technology functionality. The VAE and ViT encoders are frozen right here; solely the spine and connectors are skilled.
Continual Training (CT) expands the duty house by introducing interleaved multi-task information — modifying samples, subject-driven technology samples, and multimodal understanding information — throughout roughly 300B tokens. A progressive data-mixture schedule steadily will increase the proportion of tougher duties like modifying as coaching proceeds.
Supervised Fine-Tuning (SFT) tightens instruction following, modifying accuracy, and identification consistency utilizing curated high-quality information throughout 72B tokens.
Reinforcement Learning (RL) makes use of Group Relative Policy Optimization (GRPO), with PaddleOCR serving because the reward mannequin, to additional sharpen textual content rendering accuracy and image-text alignment.
Everything matches inside a most coaching funds of 128 GPUs.
Results
Image Generation. On GenEval, Lance scores 0.90 general, matching TUNA for the highest spot amongst unified fashions. Subcategory scores embody counting (0.84), colours (0.97), and spatial place (0.87). On DPG-Bench, Lance scores 84.67 general, with significantly sturdy relation modeling — although TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter effectivity in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the highest unified mannequin rating at 3B activated parameters.
Video Generation. On VBench, Lance achieves a Total Score of 85.11 (utilizing LLM rewriting), the very best amongst unified fashions. The next-best unified mannequin, TUNA, scores 84.06. Lance additionally outscores devoted generation-only fashions together with HunyuanVideo (83.43) and Wan2.1-T2V (83.69).
Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the very best amongst unified fashions. It leads in background change, materials modification, movement change, portrait beautification, topic removing, topic alternative, and tone switch. Text modification is flagged as a remaining weak spot.
Video Understanding. On MVBench, Lance achieves a 62.0 general rating, the very best amongst unified fashions. Show-o2 (7B), the next-best unified mannequin, scores 55.7. Lance additionally outperforms a number of understanding-only fashions with extra parameters — notable on condition that it’s concurrently skilled for technology and modifying.
Marktechpost’s Visual Explainer
Getting Started with Lance by ByteDance
A step-by-step information to putting in and working Lance — a 3B native unified multimodal mannequin for picture & video understanding, technology, and modifying.
Step 1 of 6
Key Takeaways
- Lance is a 3B activated parameter native unified multimodal mannequin that handles picture and video understanding, technology, and modifying inside a single collectively skilled framework.
- A dual-stream mixture-of-experts structure with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and technology pathways whereas maintaining them in shared interleaved multimodal context.
- Lance achieves 0.90 on GenEval and 85.11 on VBench, the very best Total Score amongst unified fashions, skilled inside a most funds of 128 GPUs.
- On MVBench, Lance scores 62.0, the very best amongst unified fashions — outperforming Show-o2 (7B) at 55.7, whereas additionally supporting technology and modifying.
- Lance is open-source underneath Apache 2.0, with weights obtainable on Hugging Face.
Check out the Paper, Model Weights and Project Page. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The submit One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing appeared first on MarkTechPost.




