Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

How a lot functionality can a sparse 8.3B-parameter MoE with a ~1.5B lively path ship in your telephone with out blowing latency or reminiscence? Liquid AI has launched LFM2-8B-A1B, a small-scale Mixture-of-Experts (MoE) mannequin constructed for on-device execution underneath tight reminiscence, latency, and vitality budgets. Unlike most MoE work optimized for cloud batch serving, LFM2-8B-A1B targets telephones, laptops, and embedded programs. It showcases 8.3B complete parameters however prompts solely ~1.5B parameters per token, utilizing sparse knowledgeable routing to protect a small compute path whereas rising representational capability. The mannequin is launched underneath the LFM Open License v1.0 (lfm1.0)
Understanding the Architecture
LFM2-8B-A1B retains the LFM2 ‘quick spine’ and inserts sparse-MoE feed-forward blocks to carry capability with out materially rising the lively compute. The spine makes use of 18 gated short-convolution blocks and 6 grouped-query consideration (GQA) blocks. All layers besides the primary two embody an MoE block; the primary two stay dense for stability. Each MoE block defines 32 consultants; the router selects top-4 consultants per token with a normalized-sigmoid gate and adaptive routing bias to steadiness load and stabilize coaching. Context size is 32,768 tokens; vocabulary measurement 65,536; reported pre-training finances ~12T tokens.
This strategy retains per-token FLOPs and cache progress bounded by the lively path (consideration + 4 knowledgeable MLPs), whereas complete capability permits specialization throughout domains similar to multilingual data, math, and code—use circumstances that always regress on very small dense fashions.

Performance indicators
Liquid AI reviews that LFM2-8B-A1B runs considerably quicker than Qwen3-1.7B underneath CPU checks utilizing an inner XNNPACK-based stack and a customized CPU MoE kernel. The public plots cowl int4 quantization with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra. The Liquid AI crew positions high quality as corresponding to 3–4B dense fashions, whereas maintaining the lively compute close to 1.5B. No cross-vendor “×-faster” headline multipliers are revealed; the claims are framed as per-device comparisons versus equally lively fashions.
On accuracy, the mannequin card lists outcomes throughout 16 benchmarks, together with MMLU/MMLU-Pro/GPQA (data), IFEval/IFBench/Multi-IF (instruction following), GSM8K/GSMPlus/MATH500/MATH-Lvl-5 (math), and MGSM/MMMLU (multilingual). The numbers point out aggressive instruction-following and math efficiency inside the small-model band, and improved data capability relative to LFM2-2.6B, constant with the bigger complete parameter finances.


Deployment and tooling
LFM2-8B-A1B ships with Transformers/vLLM for GPU inference and GGUF builds for llama.cpp; the official GGUF repo lists widespread quants from Q4_0 ≈4.7 GB as much as F16 ≈16.7 GB for native runs, whereas llama.cpp requires a latest construct with lfm2moe
help (b6709+) to keep away from “unknown mannequin structure” errors. Liquid’s CPU validation makes use of Q4_0 with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra, the place LFM2-8B-A1B exhibits greater decode throughput than Qwen3-1.7B at a comparable active-parameter class; ExecuTorch is referenced for cellular/embedded CPU deployment.


Key Takeaways
- Architecture & routing: LFM2-8B-A1B pairs an LFM2 quick spine (18 gated short-conv blocks + 6 GQA blocks) with per-layer sparse-MoE FFNs (all layers besides the primary two), utilizing 32 consultants with top-4 routing through normalized-sigmoid gating and adaptive biases; 8.3B complete params, ~1.5B lively per token.
- On-device goal: Designed for telephones, laptops, and embedded CPUs/GPUs; quantized variants “match comfortably” on high-end shopper {hardware} for personal, low-latency use.
- Performance positioning. Liquid reviews LFM2-8B-A1B is considerably quicker than Qwen3-1.7B in CPU checks and goals for 3–4B dense-class high quality whereas maintaining an ~1.5B lively path.
Editorial Comments
LFM2-8B-A1B demonstrates that sparse MoE will be sensible beneath the same old server-scale regime. The mannequin combines an LFM2 conv-attention spine with per-layer knowledgeable MLPs (besides the primary two layers) to maintain token compute close to 1.5B whereas lifting high quality towards 3–4B dense courses. With commonplace and GGUF weights, llama.cpp/ExecuTorch/vLLM paths, and a permissive on-device posture, LFM2-8B-A1B is a concrete possibility for constructing low-latency, personal assistants and application-embedded copilots on shopper and edge {hardware}.
Check out the Model on Hugging Face and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token appeared first on MarkTechPost.