One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing

Building a single mannequin that may each perceive and generate pictures and movies is tougher than it sounds. The two duties pull in reverse instructions. Understanding advantages from high-level semantic options tightly aligned with language. Generation wants low-level steady representations that protect texture, geometry, and temporal dynamics. Most methods deal with this stress by separating the 2 into distinct architectures, then bridging them post-hoc.

ByteDance analysis group took a unique strategy with Lance. Rather than assembling separate elements, the analysis group designed a mannequin that natively integrates understanding, technology, and modifying throughout each picture and video modalities — skilled collectively from the beginning.

https://arxiv.org/pdf/2605.18678

What Lance Can Do

Lance organizes its capabilities into three output households: textual content (X2T), pictures (X2I), and movies (X2V). On the understanding aspect, this covers picture and video captioning, visible query answering, OCR, visible grounding, and reasoning. On the technology aspect, it handles text-to-image, text-to-video, image-to-video, subject-driven technology, picture modifying, and video modifying — together with multi-turn consistency modifying throughout each modalities.

This all-in-one functionality is a significant milestone. While commonplace unified architectures usually cease at fundamental picture understanding and text-to-image technology, Lance is among the many few to natively bridge the whole image-video ecosystem throughout each understanding and technology duties.

https://arxiv.org/pdf/2605.18678

How the Architecture Works

The structure relies on two rules: unified context modeling and decoupled functionality pathways.

For unified context, Lance converts all inputs — textual content, pictures, and movies — right into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visible inputs, the Qwen2.5-VL ViT encoder produces compact semantic visible tokens. For generation-oriented visible inputs, the Wan2.2 3D causal VAE encoder encodes pictures and movies into steady latent representations, making use of 16× spatial downsampling and 4× temporal downsampling. All these heterogeneous token varieties — textual content, semantic visible, and latent visible — stay in the identical sequence. The mannequin then runs generalized 3D causal consideration over the complete context, with textual content tokens utilizing causal consideration and visible tokens utilizing bidirectional consideration.

For decoupled pathways, Lance makes use of a dual-stream mixture-of-experts structure initialized from Qwen2.5-VL 3B. The understanding knowledgeable (LLMUND) handles textual content and semantic visible tokens, producing outputs for multimodal reasoning and textual content technology. The technology knowledgeable (LLMGEN) handles VAE latent tokens for visible synthesis and modifying. Crucially, each consultants function over the identical shared interleaved sequence — they share context however don’t compete for the identical parameters. The understanding knowledgeable is skilled with a next-token prediction loss; the technology knowledgeable is skilled with a move matching goal in steady latent house. The two losses are mixed with configurable weights all through coaching.

Modality-Aware Rotary Positional Encoding (MaPE)

Running ViT semantic tokens, clear VAE situation tokens, and noisy VAE goal tokens by the identical sequence creates a refined drawback. Standard 3D-RoPE encodes positions primarily based on spatiotemporal structure alone — it has no technique to inform these token teams aside. When a number of visible token teams occupy the identical sequence, their positional boundaries turn out to be ambiguous, which might damage cross-task alignment.

Lance introduces Modality-Aware Rotary Positional Encoding (MaPE) to repair this. MaPE applies a hard and fast temporal offset to every modality group primarily based on its index within the sequence. Spatial coordinates keep unchanged, so the intrinsic structure inside pictures and movies is preserved. The temporal offset alone is sufficient to separate the token teams within the world positional house with out disrupting temporal ordering inside any particular person video.

Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to six.30, and VBench from 81.81 to 80.95 — constant degradation throughout technology, modifying, and understanding.

Training: Four Stages, One Unified Framework

Lance is skilled by 4 sequential levels, every constructing on the final.

Pre-Training (PT) lays the muse utilizing roughly 1B image-text and 140M video-text pairs, overlaying 1.5T coaching tokens. This stage establishes fundamental multimodal alignment and technology functionality. The VAE and ViT encoders are frozen right here; solely the spine and connectors are skilled.

Continual Training (CT) expands the duty house by introducing interleaved multi-task information — modifying samples, subject-driven technology samples, and multimodal understanding information — throughout roughly 300B tokens. A progressive data-mixture schedule steadily will increase the proportion of tougher duties like modifying as coaching proceeds.

Supervised Fine-Tuning (SFT) tightens instruction following, modifying accuracy, and identification consistency utilizing curated high-quality information throughout 72B tokens.

Reinforcement Learning (RL) makes use of Group Relative Policy Optimization (GRPO), with PaddleOCR serving because the reward mannequin, to additional sharpen textual content rendering accuracy and image-text alignment.

Everything matches inside a most coaching funds of 128 GPUs.

Results

Image Generation. On GenEval, Lance scores 0.90 general, matching TUNA for the highest spot amongst unified fashions. Subcategory scores embody counting (0.84), colours (0.97), and spatial place (0.87). On DPG-Bench, Lance scores 84.67 general, with significantly sturdy relation modeling — although TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter effectivity in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the highest unified mannequin rating at 3B activated parameters.

Video Generation. On VBench, Lance achieves a Total Score of 85.11 (utilizing LLM rewriting), the very best amongst unified fashions. The next-best unified mannequin, TUNA, scores 84.06. Lance additionally outscores devoted generation-only fashions together with HunyuanVideo (83.43) and Wan2.1-T2V (83.69).

Image Editing. On GEdit-Bench, Lance scores 7.30 Avg/G_O, the very best amongst unified fashions. It leads in background change, materials modification, movement change, portrait beautification, topic removing, topic alternative, and tone switch. Text modification is flagged as a remaining weak spot.

Video Understanding. On MVBench, Lance achieves a 62.0 general rating, the very best amongst unified fashions. Show-o2 (7B), the next-best unified mannequin, scores 55.7. Lance additionally outperforms a number of understanding-only fashions with extra parameters — notable on condition that it’s concurrently skilled for technology and modifying.

Marktechpost’s Visual Explainer

How—To Guide

Getting Started with Lance by ByteDance

A step-by-step information to putting in and working Lance — a 3B native unified multimodal mannequin for picture & video understanding, technology, and modifying.







Step 1 of 6

Step 01 — Prerequisites

Check Your Environment First

Before cloning the repository, affirm your system meets the minimal software program and {hardware} necessities. Lance requires CUDA-capable {hardware} with important VRAM.

🐍
Python
3.10 or larger
Required

⚡
CUDA
12.4 or larger
Required

🖥
GPU VRAM
40 GB minimal
For inference

📦
License
Apache 2.0
Open—supply

Note: A GPU with at the very least 40 GB VRAM is required for working inference. CUDA 12.4+ is obligatory — decrease variations are usually not formally supported.

Step 02 — Clone the Repository

Clone from GitHub

Clone the official Lance repository from ByteDance on GitHub. The repository contains the inference scripts, Gradio interface, benchmark scripts, and mannequin configuration recordsdata.

git clone https://github.com/bytedance/Lance
cd Lance

The repository construction you will notice after cloning:

inference_lance.py
Main inference script for all duties

inference_lance.sh
Shell wrapper with configurable parameters

lance_gradio_t2v_v2t.py
Gradio UI for T2V and V2T duties

config/examples/
JSON instance configs per activity kind

Step 03 — Install Dependencies

Install Required Packages

Install all Python dependencies from the supplied necessities.txt file. It is strongly really useful to make use of a devoted digital setting or conda setting earlier than putting in.

# Create and activate a conda setting (really useful)
conda create -n lance-env python=3.10 -y
conda activate lance-env

# Install all dependencies
pip set up -r necessities.txt

Tip: Using a clear conda setting prevents dependency conflicts with different tasks on the identical machine.

Step 04 — Download Model Weights

Download Lance—3B Checkpoints

Download all needed mannequin checkpoints from the official Hugging Face repository at bytedance-research/Lance. After downloading, place all recordsdata within the downloads/ listing inside your cloned repo.

# Install the Hugging Face CLI if not already put in
pip set up huggingface_hub

# Download the mannequin weights
huggingface-cli obtain bytedance-research/Lance 
  --local-dir downloads/

Your listing ought to appear to be this after downloading:

Lance/
└── downloads/
    └── Lance_3B_Video/     ◄ mannequin weights go right here

Note: Model weights are massive recordsdata. Ensure you have got adequate disk house and a secure connection earlier than downloading.

Step 05 — Run Inference

Run Tasks through the CLI

Lance gives a unified command—line interface for all duties through inference_lance.sh. Configure parameters on the prime of the shell script earlier than working. Supported duties are listed beneath.

t2i
Text—to—picture technology

t2v
Text—to—video technology

image_edit
Image modifying from instruction

video_edit
Video modifying from instruction

x2t_image
Image understanding / VQA

x2t_video
Video understanding / captioning

Example command for textual content—to—video technology at 480p:

bash inference_lance.sh 
  --TASK_NAME t2v 
  --MODEL_PATH downloads/Lance_3B_Video 
  --RESOLUTION video_480p 
  --NUM_FRAMES 121 
  --VIDEO_HEIGHT 480 
  --VIDEO_WIDTH 848 
  --SAVE_PATH_GEN outcomes/t2v

Step 06 — Gradio UI & Tips

Launch the Gradio Interface (Optional)

For a visible interface overlaying textual content—to—video and video—to—textual content duties, Lance features a prepared—to—run Gradio app.

python lance_gradio_t2v_v2t.py

Prompt Tips

For all duties, observe the immediate format used within the supplied instance configs underneath config/examples/. Using the really useful format usually results in higher technology high quality.

x2t_image_example.json
Examples for picture understanding and VQA

x2t_video_example.json
Examples for video understanding and captioning

Customize: You can modify TASK_DEFAULT_CONFIGS in inference_lance.py to set your individual default information samples for every activity kind.

Key Takeaways

  1. Lance is a 3B activated parameter native unified multimodal mannequin that handles picture and video understanding, technology, and modifying inside a single collectively skilled framework.
  2. A dual-stream mixture-of-experts structure with Modality-Aware Rotary Positional Encoding (MaPE) decouples understanding and technology pathways whereas maintaining them in shared interleaved multimodal context.
  3. Lance achieves 0.90 on GenEval and 85.11 on VBench, the very best Total Score amongst unified fashions, skilled inside a most funds of 128 GPUs.
  4. On MVBench, Lance scores 62.0, the very best amongst unified fashions — outperforming Show-o2 (7B) at 55.7, whereas additionally supporting technology and modifying.
  5. Lance is open-source underneath Apache 2.0, with weights obtainable on Hugging Face.


Check out the Paper, Model Weights and Project PageAlso, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The submit One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing appeared first on MarkTechPost.

Similar Posts