Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference
Liquid AI shipped LFM2.5-230M, it’s the corporate’s smallest mannequin up to now. The launch targets a selected job: working agentic duties on telephones, robots, and automation units. Both the bottom and instruction-tuned checkpoints are open-weight on Hugging Face.
The pitch is slim on function. This just isn’t a basic reasoning mannequin. It is constructed for knowledge extraction and instrument use on edge {hardware}.
TL;DR
- Liquid AI’s LFM2.5-230M is its smallest mannequin but: 230M params, open-weight, constructed on LFM2.
- Runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5.
- Beats bigger fashions (Qwen3.5-0.8B, Gemma 3 1B) on instruction following and knowledge extraction.
- Tuned for instrument use and extraction; not for math, code technology, or artistic writing.
- Day-one help throughout llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint.
What is LFM2.5-230M?
LFM2.5-230M is a 230-million-parameter, text-only mannequin. It is constructed on the LFM2 structure. The mannequin has 14 layers whole. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query consideration (GQA) blocks. The hybrid format targets quick CPU inference.
The context size is 32,768 tokens. The vocabulary dimension is 65,536. The data cutoff is mid-2024. It helps ten languages, together with English, Chinese, Arabic, and Japanese.
Liquid AI crew ships two checkpoints. LFM2.5-230M-Base is the pre-trained mannequin for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned model. The license is lfm1.0.
Training and Post-Training
The mannequin was pre-trained on 19 trillion tokens. That whole features a 32K context extension section. The post-training recipe then runs in three levels.
First comes supervised fine-tuning with distillation from the bigger LFM2.5-350M. Second is direct choice optimization (DPO). Third is multi-domain reinforcement studying. This preserves flexibility for downstream specialization.
The distillation step is what retains a 230M mannequin aggressive with bigger checkpoints. It inherits habits from the larger LFM2.5-350M on focused duties.
Benchmark
Liquid AI crew evaluated LFM2.5-230M throughout ten benchmarks. They span data, instruction following, knowledge extraction, and instrument use.
The instruction-following outcomes help that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). On IFBench it scores 38.40, forward of each. On CaseReportBench, a scientific data-extraction check, it scores 22.51.
| Model | Params | IFEval | IFBench | CaseReportBench | BFCLv4 | MMLU-Pro |
|---|---|---|---|---|---|---|
| LFM2.5-230M | 230M | 71.71 | 38.40 | 22.51 | 21.03 | 20.25 |
| LFM2.5-350M | 350M | 76.96 | 40.69 | 32.45 | 21.86 | 20.01 |
| Granite 4.0-H-350M | 350M | 61.27 | 17.22 | 12.44 | 13.28 | 13.14 |
| Qwen3.5-0.8B (Instruct) | 800M | 59.94 | 22.87 | 13.83 | 18.70 | 37.42 |
| Gemma 3 1B IT | 1B | 63.49 | 20.33 | 2.28 | 7.17 | 14.04 |
LFM2.5-230M leads on instruction following and knowledge extraction. It trails on broad data: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It can be weak on some agentic instrument use. On τ²-Bench Telecom it scores simply 5.26.
Liquid AI is direct concerning the limits. It doesn’t advocate the mannequin for reasoning-heavy workloads. That means superior math, code technology, and artistic writing.
Use Cases With Examples
The mannequin matches two jobs nicely.
- The first is large-scale knowledge extraction pipelines. Picture a pipeline parsing 100,000 scientific reviews into structured fields. A 4-bit construct with a 293–375 MB reminiscence footprint runs that on commodity CPUs. You extract regionally, with no per-token API invoice.
- The second job is light-weight on-device agentic workloads. Think a house automation hub that turns speech into instrument calls. Or a telephone assistant that routes a request to the best operate.
As an early sign, Liquid AI deployed the mannequin on a Unitree G1 humanoid robotic. It ran solely on the robotic’s onboard NVIDIA Jetson Orin. There the mannequin acted as a skill-selection layer. It turned one natural-language instruction right into a sequence of instrument calls. Those calls invoked low-level abilities from NVIDIA’s SONIC framework.
Tool Use: How It Works
LFM2.5 helps operate calling in 4 steps. You outline instruments as JSON within the system immediate. The mannequin writes a Pythonic operate name between particular tokens. You execute the decision and return the end result. The mannequin then writes a plain-text reply.
By default the decision is a Python listing. It sits between the <|tool_call_start|> and <|tool_call_end|> tokens. Here is the documented sample, with the instrument JSON abbreviated:
<|im_start|>system
List of instruments: [{"name": "get_candidate_status",
"parameters": {"candidate_id": {"type": "string"}}}]<|im_end|>
<|im_start|>consumer
What is the present standing of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the present standing of candidate ID 12345.<|im_end|>
You may also drive JSON-formatted calls by means of the system immediate.
Running It: A Minimal Example
The mannequin works with Transformers 5.0.0 and up. The beneficial technology settings are temperature 0.1, top_k 50, and repetition_penalty 1.05. Note the do_sample=True flag, which is required for these sampling settings to use.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "LiquidAI/LFM2.5-230M"
mannequin = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer.apply_chat_template(
[{"role": "user", "content": "What is C. elegans?"}],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(mannequin.system)
output = mannequin.generate(
**inputs,
do_sample=True,
temperature=0.1,
top_k=50,
repetition_penalty=1.05,
max_new_tokens=512,
)
print(tokenizer.decode(output[0][inputs["input_ids"].form[-1]:], skip_special_tokens=True))
Liquid AI additionally publishes fine-tuning recipes. They cowl SFT, DPO, and GRPO with LoRA, through Unsloth and TRL. Each ships as a Colab pocket book.
Interactive Explainer
Check out the Model weight on HF, Technical details and Docs. Also, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference appeared first on MarkTechPost.
