|

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Support for On-Device Inference

Liquid AI shipped LFM2.5-230M, it’s the corporate’s smallest mannequin up to now. The launch targets a selected job: working agentic duties on telephones, robots, and automation units. Both the bottom and instruction-tuned checkpoints are open-weight on Hugging Face.

The pitch is slim on function. This just isn’t a basic reasoning mannequin. It is constructed for knowledge extraction and instrument use on edge {hardware}.

TL;DR

  • Liquid AI’s LFM2.5-230M is its smallest mannequin but: 230M params, open-weight, constructed on LFM2.
  • Runs on-device at 213 tok/s on a Galaxy S25 Ultra and 42 on a Raspberry Pi 5.
  • Beats bigger fashions (Qwen3.5-0.8B, Gemma 3 1B) on instruction following and knowledge extraction.
  • Tuned for instrument use and extraction; not for math, code technology, or artistic writing.
  • Day-one help throughout llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint.

What is LFM2.5-230M?

LFM2.5-230M is a 230-million-parameter, text-only mannequin. It is constructed on the LFM2 structure. The mannequin has 14 layers whole. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query consideration (GQA) blocks. The hybrid format targets quick CPU inference.

The context size is 32,768 tokens. The vocabulary dimension is 65,536. The data cutoff is mid-2024. It helps ten languages, together with English, Chinese, Arabic, and Japanese.

Liquid AI crew ships two checkpoints. LFM2.5-230M-Base is the pre-trained mannequin for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned model. The license is lfm1.0.

Training and Post-Training

The mannequin was pre-trained on 19 trillion tokens. That whole features a 32K context extension section. The post-training recipe then runs in three levels.

First comes supervised fine-tuning with distillation from the bigger LFM2.5-350M. Second is direct choice optimization (DPO). Third is multi-domain reinforcement studying. This preserves flexibility for downstream specialization.

The distillation step is what retains a 230M mannequin aggressive with bigger checkpoints. It inherits habits from the larger LFM2.5-350M on focused duties.

Benchmark

Liquid AI crew evaluated LFM2.5-230M throughout ten benchmarks. They span data, instruction following, knowledge extraction, and instrument use.

The instruction-following outcomes help that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). On IFBench it scores 38.40, forward of each. On CaseReportBench, a scientific data-extraction check, it scores 22.51.

Model Params IFEval IFBench CaseReportBench BFCLv4 MMLU-Pro
LFM2.5-230M 230M 71.71 38.40 22.51 21.03 20.25
LFM2.5-350M 350M 76.96 40.69 32.45 21.86 20.01
Granite 4.0-H-350M 350M 61.27 17.22 12.44 13.28 13.14
Qwen3.5-0.8B (Instruct) 800M 59.94 22.87 13.83 18.70 37.42
Gemma 3 1B IT 1B 63.49 20.33 2.28 7.17 14.04

LFM2.5-230M leads on instruction following and knowledge extraction. It trails on broad data: MMLU-Pro is 20.25, behind Qwen3.5-0.8B’s 37.42. It can be weak on some agentic instrument use. On τ²-Bench Telecom it scores simply 5.26.

Liquid AI is direct concerning the limits. It doesn’t advocate the mannequin for reasoning-heavy workloads. That means superior math, code technology, and artistic writing.

Use Cases With Examples

The mannequin matches two jobs nicely.

  • The first is large-scale knowledge extraction pipelines. Picture a pipeline parsing 100,000 scientific reviews into structured fields. A 4-bit construct with a 293–375 MB reminiscence footprint runs that on commodity CPUs. You extract regionally, with no per-token API invoice.
  • The second job is light-weight on-device agentic workloads. Think a house automation hub that turns speech into instrument calls. Or a telephone assistant that routes a request to the best operate.

As an early sign, Liquid AI deployed the mannequin on a Unitree G1 humanoid robotic. It ran solely on the robotic’s onboard NVIDIA Jetson Orin. There the mannequin acted as a skill-selection layer. It turned one natural-language instruction right into a sequence of instrument calls. Those calls invoked low-level abilities from NVIDIA’s SONIC framework.

Tool Use: How It Works

LFM2.5 helps operate calling in 4 steps. You outline instruments as JSON within the system immediate. The mannequin writes a Pythonic operate name between particular tokens. You execute the decision and return the end result. The mannequin then writes a plain-text reply.

By default the decision is a Python listing. It sits between the <|tool_call_start|> and <|tool_call_end|> tokens. Here is the documented sample, with the instrument JSON abbreviated:

<|im_start|>system
List of instruments: [{"name": "get_candidate_status",
  "parameters": {"candidate_id": {"type": "string"}}}]<|im_end|>
<|im_start|>consumer
What is the present standing of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the present standing of candidate ID 12345.<|im_end|>

You may also drive JSON-formatted calls by means of the system immediate.

Running It: A Minimal Example

The mannequin works with Transformers 5.0.0 and up. The beneficial technology settings are temperature 0.1, top_k 50, and repetition_penalty 1.05. Note the do_sample=True flag, which is required for these sampling settings to use.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LiquidAI/LFM2.5-230M"
mannequin = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is C. elegans?"}],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(mannequin.system)

output = mannequin.generate(
    **inputs,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)
print(tokenizer.decode(output[0][inputs["input_ids"].form[-1]:], skip_special_tokens=True))

Liquid AI additionally publishes fine-tuning recipes. They cowl SFT, DPO, and GRPO with LoRA, through Unsloth and TRL. Each ships as a Colab pocket book.

Interactive Explainer