|

Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters

Liquid AI simply shipped LFM2.5-8B-A1B. It is an on-device Mixture-of-Experts (MoE) mannequin constructed for software calling. The mannequin holds 8.3B complete parameters however prompts solely 1.5B per token. That sparsity is what lets it run on shopper {hardware}.

The launch follows LFM2-8B-A1B, which Liquid AI workforce revealed earlier. LFM2.5 is a brand new household of hybrid fashions for on-device deployment. This model provides a 128K context window, reasoning, and scaled-up coaching.

What is LFM2.5-8B-A1B

The mannequin makes use of a sparse MoE design. It prompts 1.5B of 8.3B complete parameters per ahead move. That retains every generated token low-cost to compute.

The structure has 24 layers. Eighteen are double-gated LIV convolution blocks; six are GQA layers. It combines MoE, GQA, and gated brief convolution blocks. The context size is 131,072 tokens. The mannequin covers 9 languages, together with Arabic, Chinese, and Japanese.

Liquid AI workforce recommends a temperature of 0.2, top_k of 80, and repetition_penalty of 1.05.

Unlike its predecessor, LFM2.5-8B-A1B is a reasoning-only mannequin. It produces an specific chain of thought earlier than its ultimate reply. Liquid AI workforce selected this as a result of MoE fashions run in compute-bound settings. A smaller energetic parameter depend makes every reasoning token cheap.

What Changed Since LFM2-8B-A1B

Liquid expanded the context window from 32,768 to 128,000 tokens. Pretraining scaled from 12T to 38T tokens. The vocabulary doubled from 65,536 to 128,000 tokens.

The bigger vocabulary tokenizes non-Latin scripts extra effectively. Liquid AI workforce stories the strongest compression good points in Hindi, Thai, Vietnamese, Indonesian, and Arabic. The remainder of the structure stays the identical as LFM2-8B-A1B.

How Liquid AI Trained It

Liquid AI workforce prolonged the tokenizer in place slightly than retraining from scratch. It continued BPE merge coaching from the unique merges on a multilingual corpus. New embedding rows initialize because the imply of their sub-token decompositions. A short two-stage adaptation then recovers high quality.

Context extension got here in two phases. A 2T token midtraining part reached 32K, centered on reasoning, math, and software use. Raising the RoPE base θ, plus a 400B token stage, reached 128K.

Two reinforcement studying phases goal recognized failure modes. A choice optimization stage reduces ‘doom loops’ in lengthy reasoning traces. It redistributes likelihood mass towards believable options. A separate RL shaping reward discourages loop-inducing restart phrases like ‘Wait…’. Another RL stage makes use of an avg@k-based reward to chop hallucinations. The purpose is abstention on queries past dependable data.

https://www.liquid.ai/weblog/lfm2-5-8b-a1b

The Benchmark Case

LFM2.5-8B-A1B improves over its predecessor throughout the board. The AA-Omniscience Non-Hallucination Rate jumped from 7.46 to 63.47. IFEval rose from 79.44 to 91.84. MATH500 climbed from 74.80 to 88.76. Tau² Telecom rose from 13.60 to 88.07.

Liquid AI workforce in contrast the mannequin towards dense and MoE options. On instruction following, it matches Gemma-4-26B-A4B-IT on IFEval. It does so at a fraction of the energetic parameter depend. On Tau² Telecom, it scores 88.07, forward of a lot bigger fashions.

The avg@ok reward drives a a lot decrease hallucination price. Accuracy stays affordable for the mannequin’s dimension. On agentic benchmarks, it stays aggressive with larger fashions.

Benchmark LFM2-8B-A1B LFM2.5-8B-A1B Δ
AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01
IFEval 79.44 91.84 +12.40
MATH500 74.80 88.76 +13.96
Tau² Telecom 13.60 88.07 +74.47

Running It: CPU, GPU, and Tooling

The mannequin ships with day-one help throughout the inference ecosystem. Frameworks embody llama.cpp, MLX, vLLM, and SGLang. ONNX and Liquid’s LEAP edge platform are additionally supported.

On CPU, it decodes 253 tokens/s on an M5 Max. It reaches 146 tokens/s on a Ryzen AI Max+ 395. It stays below 6 GB of reminiscence all through. On a cellphone, it holds about 30 tokens/s.

On a single NVIDIA H100 SXM5, output throughput hits 18.5K tokens per second. That is over 1.6B tokens per day at excessive concurrency.

For software use, LFM2.5 writes Pythonic perform calls by default. They seem between the <|tool_call_start|> and <|tool_call_end|> particular tokens. You can override this to JSON within the system immediate.

Strengths and What to Watch

Strengths:

  • Activates solely 1.5B parameters, conserving inference low-cost on edge {hardware}
  • Competitive instruction-following and agentic scores for its dimension class
  • 128K context window and nine-language protection
  • Open-weight below the LFM1.0 license, with base and post-trained checkpoints

What to Watch:

  • Limited data capability from the small energetic parameter depend
  • Not a match for heavy programming or knowledge-intensive QA with out retrieval
  • Reasoning-only output provides chain-of-thought tokens to each flip
  • Text-only; this variant has no imaginative and prescient or audio enter

Marktechpost’s Visual Explainer

On-Device Model Guide

LFM2.5-8B-A1B

Liquid AI’s on-device Mixture-of-Experts mannequin, constructed for software calling and complicated instruction following on shopper {hardware}.

8.3B complete params
1.5B energetic
128K context
reasoning‑solely
open‑weight
Released May 28, 2026  ·  Liquid AI  ·  LFM1.0 license

What It Is

A sparse MoE that prompts 1.5B of 8.3B parameters per token

  • 24 layers — 18 double-gated LIV convolution blocks plus 6 GQA layers.
  • Combines MoE, GQA, and gated brief convolution blocks.
  • Context size of 131,072 tokens; covers 9 languages.
  • Reasoning-only: produces an specific chain of thought earlier than answering.
  • Recommended params: temperature 0.2, top_k 80, repetition_penalty 1.05.

What Changed Since LFM2-8B-A1B

Bigger context, extra coaching, a wider vocabulary

Context window

32,768 → 128,000

Processes longer paperwork and causes for longer.

Pretraining tokens

12T → 38T

Scaled-up pretraining plus large-scale RL.

Vocabulary dimension

65,536 → 128,000

Tokenizes non-Latin scripts extra effectively.

Best compression good points

5 languages

Hindi, Thai, Vietnamese, Indonesian, Arabic.

How It Was Trained

Tokenizer extension, staged context development, focused RL

  • Tokenizer: prolonged in place, with continued BPE merge coaching on a multilingual corpus.
  • Context: a 2T-token midtraining part to 32K, then RoPE base θ plus 400B tokens to 128K.
  • Doom loops: choice optimization redistributes likelihood mass towards believable options.
  • A separate RL shaping reward discourages loop-inducing restart phrases like “Wait…”.
  • Hallucinations: an avg@k-based RL reward encourages abstention past dependable data.

Benchmarks vs LFM2-8B-A1B

Largest good points in non-hallucination and software use

Benchmark LFM2 LFM2.5 Δ
AA-Omniscience Non-Hallucination Rate 7.46 63.47 +56.01
IFEval 79.44 91.84 +12.40
MATH500 74.80 88.76 +13.96
Tau² Telecom 13.60 88.07 +74.47

On IFEval it matches Gemma-4-26B-A4B-IT at a fraction of the energetic parameter depend.

Inference Performance

Fast on CPU and GPU, with day-one framework help

CPU decode

253 tok/s

M5 Max, below 6 GB reminiscence. 146 tok/s on a Ryzen AI Max+ 395.

On a cellphone

~30 tok/s

Runs regionally and privately on system.

GPU throughput

18.5K tok/s

High concurrency, >1.6B tokens/day on a single H100.

Day-one help

llama.cpp, MLX, vLLM, SGLang.

Also ONNX and Liquid’s LEAP.

Tool Use & Agents

Pythonic perform calls, prepared for on-device brokers

  • By default, writes Pythonic perform calls between <|tool_call_start|> and <|tool_call_end|> tokens.
  • You can override this to JSON perform calls within the system immediate.
  • The LocalCowork demo runs 67 instruments throughout 13 MCP servers.
  • It runs on one laptop computer — no cloud, no API keys, no information leaving the machine.

Run It

Serve in two traces, or load instantly

# Serve with vLLM (OpenAI-compatible API)
pip set up vllm
vllm serve "LiquidAI/LFM2.5-8B-A1B"

# Or load instantly with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "LiquidAI/LFM2.5-8B-A1B"
mannequin = AutoModelForCausalLM.from_pretrained(
    model_id, device_map="auto", dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(model_id)
Recommended for
Agentic workflows
Tool use
Structured outputs
Multilingual assistants
On-device assistants
Less suited to
Heavy programming
Knowledge-intensive QA with out retrieval


01 / 08

Key Takeaways

  • Liquid AI's LFM2.5-8B-A1B holds 8.3B complete parameters however prompts solely 1.5B per token.
  • It is reasoning-only, with a 128K context window and nine-language protection.
  • Non-Hallucination Rate jumped from 7.46 to 63.47 over LFM2-8B-A1B; IFEval reached 91.84.
  • It decodes 253 tok/s on an M5 Max below 6 GB, and ~30 tok/s on a cellphone.
  • Day-one help spans llama.cpp, MLX, vLLM, and SGLang, with open base and post-trained weights.


Check out the Model Weights and Technical detailsAlso, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Liquid AI Releases LFM2.5-8B-A1B: An On-Device MoE Model With 8.3B Total and 1.5B Active Parameters appeared first on MarkTechPost.

Similar Posts