Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

How a lot functionality can a sparse 8.3B-parameter MoE with a ~1.5B lively path ship in your telephone with out blowing latency or reminiscence? Liquid AI has launched LFM2-8B-A1B, a small-scale Mixture-of-Experts (MoE) mannequin constructed for on-device execution underneath tight reminiscence, latency, and vitality budgets. Unlike most MoE work optimized for cloud batch serving, LFM2-8B-A1B targets telephones, laptops, and embedded programs. It showcases 8.3B complete parameters however prompts solely ~1.5B parameters per token, utilizing sparse knowledgeable routing to protect a small compute path whereas rising representational capability. The mannequin is launched underneath the LFM Open License v1.0 (lfm1.0)

Understanding the Architecture

LFM2-8B-A1B retains the LFM2 ‘quick spine’ and inserts sparse-MoE feed-forward blocks to carry capability with out materially rising the lively compute. The spine makes use of 18 gated short-convolution blocks and 6 grouped-query consideration (GQA) blocks. All layers besides the primary two embody an MoE block; the primary two stay dense for stability. Each MoE block defines 32 consultants; the router selects top-4 consultants per token with a normalized-sigmoid gate and adaptive routing bias to steadiness load and stabilize coaching. Context size is 32,768 tokens; vocabulary measurement 65,536; reported pre-training finances ~12T tokens.

This strategy retains per-token FLOPs and cache progress bounded by the lively path (consideration + 4 knowledgeable MLPs), whereas complete capability permits specialization throughout domains similar to multilingual data, math, and code—use circumstances that always regress on very small dense fashions.

https://www.liquid.ai/weblog/lfm2-8b-a1b-an-efficient-on-device-mixture-of-experts

Performance indicators

Liquid AI reviews that LFM2-8B-A1B runs considerably quicker than Qwen3-1.7B underneath CPU checks utilizing an inner XNNPACK-based stack and a customized CPU MoE kernel. The public plots cowl int4 quantization with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra. The Liquid AI crew positions high quality as corresponding to 3–4B dense fashions, whereas maintaining the lively compute close to 1.5B. No cross-vendor “×-faster” headline multipliers are revealed; the claims are framed as per-device comparisons versus equally lively fashions.

On accuracy, the mannequin card lists outcomes throughout 16 benchmarks, together with MMLU/MMLU-Pro/GPQA (data), IFEval/IFBench/Multi-IF (instruction following), GSM8K/GSMPlus/MATH500/MATH-Lvl-5 (math), and MGSM/MMMLU (multilingual). The numbers point out aggressive instruction-following and math efficiency inside the small-model band, and improved data capability relative to LFM2-2.6B, constant with the bigger complete parameter finances.

Deployment and tooling

LFM2-8B-A1B ships with Transformers/vLLM for GPU inference and GGUF builds for llama.cpp; the official GGUF repo lists widespread quants from Q4_0 ≈4.7 GB as much as F16 ≈16.7 GB for native runs, whereas llama.cpp requires a latest construct with lfm2moe help (b6709+) to keep away from “unknown mannequin structure” errors. Liquid’s CPU validation makes use of Q4_0 with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra, the place LFM2-8B-A1B exhibits greater decode throughput than Qwen3-1.7B at a comparable active-parameter class; ExecuTorch is referenced for cellular/embedded CPU deployment.

Key Takeaways

Architecture & routing: LFM2-8B-A1B pairs an LFM2 quick spine (18 gated short-conv blocks + 6 GQA blocks) with per-layer sparse-MoE FFNs (all layers besides the primary two), utilizing 32 consultants with top-4 routing through normalized-sigmoid gating and adaptive biases; 8.3B complete params, ~1.5B lively per token.
On-device goal: Designed for telephones, laptops, and embedded CPUs/GPUs; quantized variants “match comfortably” on high-end shopper {hardware} for personal, low-latency use.
Performance positioning. Liquid reviews LFM2-8B-A1B is considerably quicker than Qwen3-1.7B in CPU checks and goals for 3–4B dense-class high quality whereas maintaining an ~1.5B lively path.

Editorial Comments

LFM2-8B-A1B demonstrates that sparse MoE will be sensible beneath the same old server-scale regime. The mannequin combines an LFM2 conv-attention spine with per-layer knowledgeable MLPs (besides the primary two layers) to maintain token compute close to 1.5B whereas lifting high quality towards 3–4B dense courses. With commonplace and GGUF weights, llama.cpp/ExecuTorch/vLLM paths, and a permissive on-device posture, LFM2-8B-A1B is a concrete possibility for constructing low-latency, personal assistants and application-embedded copilots on shopper and edge {hardware}.

Check out the Model on Hugging Face and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token appeared first on MarkTechPost.

Liquid AI Releases LFM2-8B-A1B: An On-Device Mixture-of-Experts with 8.3B Params and a 1.5B Active Params per Token

Understanding the Architecture

Performance indicators

Deployment and tooling

Key Takeaways

Editorial Comments

Cache-to-Cache(C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion

ETH and Stanford Researchers Introduce MIRIAD: A 5.8M Pair Dataset to Improve LLM Accuracy in Medical AI

BentoML Released llm-optimizer: An Open-Source AI Tool for Benchmarking and Optimizing LLM Inference

Zuckerberg outlines Meta’s AI vision for ‘personal superintelligence’

Fractional Reasoning in LLMs: A New Way to Control Inference Depth

Alibaba Qwen Team Releases Qwen-VLo: A Unified Multimodal Understanding and Generation Model

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Understanding the Architecture

Performance indicators

Deployment and tooling

Key Takeaways

Editorial Comments

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!