|

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices

Liquid AI launched LFM2-VL-3B, a 3B parameter imaginative and prescient language mannequin for picture textual content to textual content duties. It extends the LFM2-VL household past the 450M and 1.6B variants. The mannequin targets greater accuracy whereas preserving the velocity profile of the LFM2 structure. It is obtainable on LEAP and Hugging Face beneath the LFM Open License v1.0.

Model overview and interface

LFM2-VL-3B accepts interleaved picture and textual content inputs and produces textual content outputs. The mannequin exposes a ChatML like template. The processor inserts an <picture> sentinel that’s changed with encoded picture tokens at run time. The default textual content context size is 32,768 tokens. These particulars assist devs reproduce evaluations and combine the mannequin with present multimodal pipelines.

https://www.liquid.ai/weblog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Architecture

The stack pairs a language tower with a form conscious imaginative and prescient tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus consideration spine. The imaginative and prescient tower is SigLIP2 NaFlex at 400M parameters, it preserves native facet ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses picture tokens earlier than fusion with the language house. This design lets customers cap imaginative and prescient token budgets with out retraining the mannequin.

The encoder processes native resolutions up to 512×512. Larger inputs are break up into non overlapping 512×512 patches. A thumbnail pathway supplies international context throughout tiling. The environment friendly token mapping is documented with concrete examples, a 256×384 picture maps to 96 tokens, a 1000×3000 picture maps to 1,020 tokens. The mannequin card exposes person controls for minimal and most picture tokens and the tiling swap. These controls tune velocity and high quality at inference time.

Inference settings

The Hugging Face mannequin card supplies advisable parameters. Text technology makes use of temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Vision settings use min picture tokens 64, max picture tokens 256, and picture splitting enabled. The processor applies the chat template and the picture sentinel robotically. The instance makes use of AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

How is it educated?

Liquid AI describes a staged strategy. The group performs joint mid coaching that adjusts the textual content to picture ratio over time. The mannequin then undergoes supervised positive tuning centered on picture understanding. The information sources are giant scale open datasets plus in home artificial imaginative and prescient information for activity protection.

Benchmarks

The analysis group stories aggressive outcomes amongst light-weight open VLMs. On MM-IFEval the mannequin reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE rating is 89.01. The desk notes that scores for different methods had been computed with VLMEvalKit. The desk excludes Qwen3-VL-2B as a result of that system was launched sooner or later earlier.

https://www.liquid.ai/weblog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

The language functionality stays shut to the LFM2-2.6B spine. The analysis group cites 30 % on GPQA and 63 % on MMLU. This issues when notion duties embrace data queries. The group additionally states expanded multilingual visible understanding throughout English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese, and Korean.

Why edge customers ought to care?

The structure retains compute and reminiscence inside small system budgets. Image tokens are compressible and person constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves facet ratios, which helps positive grained notion. The projector reduces tokens on the connector, which improves tokens per second. The analysis group additionally printed a GGUF construct for on system runtimes. These properties are helpful for robotics, cell, and industrial shoppers that want native processing and strict information boundaries.

Key Takeaways

  1. Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex imaginative and prescient encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native facet ratios.
  2. Resolution dealing with and token budgets: Images run natively up to 512×512, bigger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for international context. Documented token mappings embrace 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
  3. Inference interface: ChatML-like prompting with an <picture> sentinel, default textual content context 32,768 tokens, advisable decoding settings, and processor-level controls for picture splitting allow reproducible analysis and straightforward integration in multimodal pipelines.
  4. Measured efficiency: Reported outcomes embrace MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only alerts from the spine are about 30% GPQA and 63% MMLU, helpful for combined notion plus data workloads.

Editorial Comments

LFM2-VL-3B is a sensible step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an environment friendly projector, which lowers picture token counts for predictable latency. Native decision processing with 512 by 512 tiling and token caps provides deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are aggressive for this measurement. Open weights, a GGUF construct, and LEAP entry cut back integration friction. Overall, that is an edge prepared VLM launch with clear controls and clear benchmarks.


Check out the Model on HF and Technical details. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Vision Language Model (VLM) to Edge-Class Devices appeared first on MarkTechPost.

Similar Posts