Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

Zyphra has launched Zamba2-VL, a household of open vision-language fashions. The launch covers three sizes: 1.2B, 2.7B, and 7B parameters. Each mannequin is constructed on the Zamba2 hybrid SSM–Transformer spine.

Vision-language fashions (VLMs) learn photos and textual content collectively. They reply questions on charts, paperwork, and pictures. Most open VLMs use a dense Transformer because the language mannequin. Zamba2-VL replaces that with a hybrid state-space design. The purpose is aggressive accuracy at decrease latency.

What is Zamba2-VL

Zamba2-VL follows the now-standard LLaVA-style VLM template. A pre-trained imaginative and prescient encoder turns picture patches into options. A light-weight MLP adapter initiatives these options into the language mannequin’s house. The language mannequin then reads an interleaved sequence of imaginative and prescient and textual content tokens. The fashions help single and multi-image understanding and grounding.

Zyphra pairs every Zamba2 spine with the Vision Transformer from Qwen2.5-VL. That encoder was chosen for 2 particular properties. It makes use of 2D rotary place embeddings and native dynamic-resolution processing. A two-layer MLP adapter connects the encoder to the spine.

https://www.zyphra.com/our-work/zamba2-vl

The Architecture

The Zamba2’s spine is the place the design diverges from typical VLMs. It is a hybrid of Mamba2 state-space layers and shared transformer blocks. The Mamba2 layers run in linear time with a fixed-size state. A small quantity of shared consideration layers are interleaved between them. Each shared block carries a singular LoRA adapter at every layer.

The Mamba2 layers carry the majority of computation cheaply. The shared consideration layers protect in-context retrieval that pure-SSM fashions quit. The hybrid trades full-attention expressivity in opposition to state-space effectivity.

Zamba2-VL makes use of the Mistral v0.1 tokenizer. It was skilled on 100B tokens of vision-text and pure-text knowledge. That knowledge was sourced from open internet datasets.

Model Quality and Benchmarks

The analysis staff evaluated Zamba2-VL throughout 14 benchmarks. These span chart, diagram, and doc understanding. They additionally cowl common notion, reasoning, and visible counting. All scores come from Zyphra’s analysis harness, which relies on VLMEvalEquipment. The report compares in opposition to the Molmo2, Qwen3-VL, and InternVL3.5 households.

Eval	Zamba2-VL-2.7B	InternVL3.5-2B	Qwen3-VL-2B	Molmo2-4B	Qwen3-VL-4B
DocVQA (take a look at)	90.9	89.4	93.3	87.8	95.3
ChartQA (take a look at)	79.6	81.6	78.7	86.1	81.8
OCRBench	73.6	83.4	84.1	62.0	84.1
CountBenchQA	87.5	70.0	87.9	91.2	87.3
PixMoCount (take a look at)	82.5	32.8	55.7	87.0	89.2
MMMU (val)	37.7	49.9	40.9	48.8	51.4
MathVista (mini)	51.0	61.4	51.8	56.5	63.6

InternVL3.5-2B and Qwen3-VL-2B are comparable in dimension. Molmo2-4B and Qwen3-VL-4B are bigger.

The sample is uneven and value understanding. Counting is the strongest class. Zyphra reviews Zamba2-VL-1.2B at 62.5 on PixMoCount. That compares with 32.8 for InternVL3.5-1B and 17.7 for PerceptionLM-1B. Document understanding additionally holds up, with DocVQA at 90.9 for the two.7B mannequin. The mannequin lags bigger baselines on knowledge-heavy reasoning, reminiscent of MMMU and MathVista.

Why Inference is Faster

Inference is the place Zamba2-VL reveals its essential benefit. Transformer consideration scales quadratically with sequence size. Multimodal inputs make sequences lengthy in a short time. A single high-resolution picture can add a number of thousand imaginative and prescient tokens. A brief video clip can produce tens of 1000’s of tokens.

Zamba2-VL avoids the rising KV cache of consideration. It inherits near-linear-time prefill and a fixed-size recurrent state. On a 32k-token prefill, it leads on the score-versus-TTFT plot. No Transformer VLM within the comparability matched its rating at comparable latency. The latency hole is not less than an order of magnitude.

The effectivity benefit is largest on the 1.2B and a pair of.7B scales. That is the vary focused for on-device and edge deployment.

Use Cases With Examples

The sensible query is the place this suits. Document and kind extraction advantages from the robust DocVQA outcomes. Think bill parsing or receipt digitization at scale. Retail and stock counting maps to the PixMoCount and CountBenchQA strengths. Grounding help allows pointing to things in product or UI photos. On-device assistants profit from the low time-to-first-token. The 1.2B mannequin targets telephones and edge packing containers. Long visible inputs, like multi-page PDFs, achieve most from linear-time prefill.

Getting Started

The three fashions stay within the Zyphra Zamba2-VL assortment on Hugging Face. Inference runs by means of Zyphra’s transformers fork, based mostly on transformers v4.57.1. The optimized Mamba2 kernels want a CUDA GPU for good latency.

Install the fork and its core dependencies:

Copy Code

pip set up "transformers @ git+https://github.com/Zyphra/transformers.git@zamba2-vl"
pip set up qwen-vl-utils==0.0.2
pip set up flash_attn

Optimized Mamba2 kernels want two extra packages:

Copy Code

pip set up --no-build-isolation "causal-conv1d @ git+https://github.com/Zyphra/z-causal-conv1d.git@zamba2-vl"
pip set up --no-build-isolation "mamba-ssm @ git+https://github.com/Zyphra/mamba.git@zamba2-vl"

Then load the mannequin and run a single-image question:

Copy Code

from transformers import Zamba2_VLForConditionalGeneration, Zamba2_VLProcessor
import torch
from PIL import Image
from qwen_vl_utils import process_vision_info
import requests

gadget = "cuda"
processor = Zamba2_VLProcessor.from_pretrained("Zyphra/Zamba2-VL-2.7B", temporal_patch_size=1)
mannequin = Zamba2_VLForConditionalGeneration.from_pretrained(
    "Zyphra/Zamba2-VL-2.7B",
    device_map=gadget,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

url = "http://photos.cocodataset.org/val2017/000000039769.jpg"
picture = Image.open(requests.get(url, stream=True).uncooked)
query = "What do you see within the picture? Give us some element."
num_img_tokens = 3400

dialog = [
    {"role": "user", "content": [
        {"type": "image", "image": image,
         "max_pixels": num_img_tokens * 28 * 28, "min_pixels": 10 * 28 * 28},
        {"type": "text", "text": question},
    ]},
]
immediate = processor.apply_chat_template(dialog, add_generation_prompt=True)
photos, _ = process_vision_info(dialog)
inputs = processor(textual content=immediate, photos=photos, add_special_tokens=True, return_tensors="pt")
inputs = {key: worth.to(gadget) for key, worth in inputs.objects()}

outputs = mannequin.generate(**inputs, max_new_tokens=100)
print(processor.tokenizer.decode(outputs[0][inputs["input_ids"].form[-1]:]))

Swap the mannequin ID for Zamba2-VL-1.2B or Zamba2-VL-7B to vary scale.

Strengths and Weaknesses

Strengths:

First open VLM household on a totally open hybrid SSM–Transformer LLM, per Zyphra.
About an order of magnitude decrease time-to-first-token than comparable Transformer baselines.
Strong visible counting and aggressive doc understanding.
Three sizes cowl edge, mid, and 7B-class deployment.
Apache 2.0 license with public weights and dealing inference code.

Weaknesses and Challenges:

Released as a analysis artifact.
Lags bigger fashions on information reasoning like MMMU and MathVista.
Lower OCRBench than same-size Qwen3-VL and InternVL3.5.
Optimized kernels want a CUDA GPU; CPU paths are sluggish.
Deployment requires self-hosting from the launched code.

Key Takeaways

Zamba2-VL ships at 1.2B, 2.7B, and 7B parameters beneath Apache 2.0.
The spine pairs Mamba2 state-space layers with a number of shared transformer blocks.
Time-to-first-token drops about an order of magnitude versus comparable Transformer VLMs.
Counting and doc understanding are strengths; information reasoning lags.
Weights and dealing inference code are public on Hugging Face and GitHub.

Marktechpost’s Interactive Explainer

Interactive Explainer

Zamba2-VL: Hybrid SSM–Transformer Vision-Language Models

Open VLMs at 1.2B, 2.7B, and 7B that exchange dense consideration with a Mamba2 state-space + Transformer hybrid. Apache 2.0.

The pipeline (faucet a stage)

Zamba2-VL follows the LLaVA-style template: imaginative and prescient encoder → adapter → language mannequin.

Token-scaling lab

Drag the slider or choose a preset. Attention prefill scales O(n²); the Mamba2 layers scale O(n).

3,400 imaginative and prescient tokensabout one high-resolution picture

Transformer consideration — prefill compute1.0×

Zamba2-VL hybrid — prefill compute1.0×

Transformer KV cache — reminiscence for contextgrows

Zamba2-VL recurrent state — reminiscence for contextmounted

At this size, the hybrid makes use of about 1.0× much less prefill compute

Measured declare: Zyphra reviews near-linear-time prefill and a fixed-size recurrent state. On a 32k-token prefill, it reviews roughly an order-of-magnitude decrease time-to-first-token than the closest Transformer baseline.
Bars above illustrate O(n²) vs O(n) scaling, not measured latency.

Benchmark explorer — Zamba2-VL-2.7B vs baselines

Pick an eval. Green is Zamba2-VL-2.7B. Higher is best.

Source: Zyphra analysis harness (VLMEvalEquipment). InternVL3.5-2B and Qwen3-VL-2B are comparable in dimension; Molmo2-4B and Qwen3-VL-4B are bigger.

Published by Marktechpost — AI/ML analysis, mannequin releases, and developer tutorials for engineers and knowledge scientists.

Zyphra Release Zamba2-VL: Hybrid Mamba2–Transformer Vision-Language Models That Cut Time-to-First-Token by About an Order of Magnitude

What is Zamba2-VL

The Architecture

Model Quality and Benchmarks

Why Inference is Faster

Use Cases With Examples

Getting Started

Strengths and Weaknesses

Strengths:

Weaknesses and Challenges:

Key Takeaways

Marktechpost’s Interactive Explainer

Zamba2-VL: Hybrid SSM–Transformer Vision-Language Models

Alibaba Qwen Team Releases Qwen3.5-397B MoE Model with 17B Active Parameters and 1M Token Context for AI agents

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages

Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Sakana AI Commercializes AB-MCTS in Sakana Marlin, an Enterprise Agent Generating Up to 100-Page Research Reports With Slides

Google AI Releases Veo 3.1 Lite: Giving Developers Low Cost High Speed Video Generation via The Gemini API

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What is Zamba2-VL

The Architecture

Model Quality and Benchmarks

Why Inference is Faster

Use Cases With Examples

Getting Started

Strengths and Weaknesses

Strengths:

Weaknesses and Challenges:

Key Takeaways

Marktechpost’s Interactive Explainer

Zamba2-VL: Hybrid SSM–Transformer Vision-Language Models

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!