Artificial Intelligence | Audio Language Model

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency

ByRicardo October 1, 2025

Liquid AI has launched LFM2-Audio-1.5B, a compact audio–language basis mannequin that each understands and generates speech and textual content via a single end-to-end stack. It positions itself for low-latency, real-time assistants on resource-constrained units, extending the LFM2 household into audio whereas retaining a small footprint.

https://www.liquid.ai/weblog/lfm2-audio-an-end-to-end-audio-foundation-model

But what’s really new? a unified spine with disentangled audio I/O

LFM2-Audio extends the 1.2B-parameter LFM2 language spine to deal with audio and textual content as first-class sequence tokens. Crucially, the mannequin disentangles audio representations: inputs are steady embeddings projected immediately from uncooked waveform chunks (~80 ms), whereas outputs are discrete audio codes. This avoids discretization artifacts on the enter path whereas protecting coaching and era autoregressive for each modalities on the output path.

On the implementation aspect, the launched checkpoint makes use of:

Backbone: LFM2 (hybrid conv + consideration), 1.2B params (LM solely)
Audio encoder: FastConformer (~115M, canary-180m-flash)
Audio decoder: RQ-Transformer predicting discrete Mimi codec tokens (8 codebooks)
Context: 32,768 tokens; vocab: 65,536 (textual content) / 2049×8 (audio)
Precision: bfloat16; license: LFM Open License v1.0; languages: English

https://www.liquid.ai/weblog/lfm2-audio-an-end-to-end-audio-foundation-model

Two era modes for real-time brokers

Interleaved era for stay, speech-to-speech chat the place the mannequin alternates textual content and audio tokens to attenuate perceived latency.
Sequential era for ASR/TTS (switching modalities turn-by-turn).

Liquid AI gives a Python bundle (liquid-audio) and a Gradio demo to breed these behaviors.

Latency: <100 ms to first audio

Liquid AI group experiences end-to-end latency beneath 100 ms from a 4-second audio question to the primary audible response—a proxy for perceived responsiveness in interactive use—stating it’s sooner than fashions smaller than 1.5B parameters below their setup.

Benchmarks: VoiceBench and ASR outcomes

On VoiceBench—a set of 9 audio-assistant evaluations—Liquid experiences an general rating of 56.78 for LFM2-Audio-1.5B, with per-task numbers disclosed within the weblog’s chart (e.g., AlpacaEval 3.71, CommonEval 3.49, WildVoice 3.17). The Liquid AI group contrasts this consequence with bigger fashions like Qwen2.5-Omni-3B and Moshi-7B in the identical desk. (VoiceBench is an exterior benchmark launched in late 2024 for LLM-based voice assistants)

The mannequin card on Hugging Face gives an extra VoiceBench desk (with carefully associated—however not similar—per-task values) and consists of traditional ASR WERs the place LFM2-Audio matches or improves on Whisper-large-v3-turbo for some datasets regardless of being a generalist speech–textual content mannequin. For instance (decrease is best): AMI 15.36 vs. 16.13 (Whisper-large-v3-turbo), LibriSpeech-clean 2.03 vs. 2.10.

https://huggingface.co/LiquidAI/LFM2-Audio-1.5B

Alright, however why does it actually matter in voice AI developments?

Most “omni” stacks couple ASR → LLM → TTS, which provides latency and brittle interfaces. LFM2-Audio’s single-backbone design with steady enter embeddings and discrete output codes reduces glue logic and permits interleaved decoding for early audio emission. For builders, this interprets to less complicated pipelines and sooner perceived response instances, whereas nonetheless supporting ASR, TTS, classification, and conversational brokers from one mannequin. Liquid AI gives code, demo entry factors, and distribution by way of Hugging Face.

Check out the GitHub Page, Hugging Face Model Card and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency appeared first on MarkTechPost.

Artificial Intelligence Editors Pick

Meet dots.ocr: A New 1.7B Vision-Language Model that Achieves SOTA Performance on Multilingual Document Parsing
ByRicardo August 16, 2025

dots.ocr is an open-source vision-language transformer model developed for multilingual document layout parsing and optical character recognition (OCR). It performs both layout detection and content recognition within a single architecture, supporting over 100 languages and a wide variety of structured and unstructured document types. Architecture Unified Model: dots.ocr combines layout detection and content recognition into…

Read More Meet dots.ocr: A New 1.7B Vision-Language Model that Achieves SOTA Performance on Multilingual Document Parsing
Artificial Intelligence

5 best AI observability tools in 2025
ByRicardo October 6, 2025

Guest writer: Or Hillel, Green Lamp AI techniques aren’t experimental anymore, they’re embedded in on a regular basis choices that have an effect on thousands and thousands. Yet as these fashions stretch into essential areas like real-time provide chain routing, medical diagnostics, and monetary markets, one thing so simple as a stealthy information shift or…

Read More 5 best AI observability tools in 2025
Artificial Intelligence large language models

Turning Audio Data into Actionable Insights with Azure
ByRicardo June 17, 2025

In today’s data-driven landscape, organizations collect large amounts of audio data that contain valuable insights. However, its unstructured format makes it challenging to analyze effectively, limiting the ability to leverage this information for business improvement. Fortunately, advances in digital tools have made it easier to unlock the value of voice data. Speech-to-Text (STT) technology plays…

Read More Turning Audio Data into Actionable Insights with Azure
Artificial Intelligence Audio Language Model

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)
ByRicardo October 5, 2025

Can a speech enhancer educated solely on actual noisy recordings cleanly separate speech and noise—with out ever seeing paired information? A crew of researchers from Brno University of Technology and Johns Hopkins University proposes Unsupervised Speech Enhancement utilizing Data-defined Priors (USE-DDP), a dual-stream encoder–decoder that separates any noisy enter into two waveforms—estimated clear speech and…

Read More This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)
AGI Artificial Intelligence

OpenAI and Oracle announce Stargate AI data centre deal
ByRicardo July 22, 2025

OpenAI has shaken hands with Oracle on a colossal deal to advance the former’s colossal Stargate AI data centre initiative. It’s one thing to talk about the AI revolution in abstract terms, but it’s another thing entirely to grasp the sheer physical scale of what’s being built to make it happen. The foundations of our…

Read More OpenAI and Oracle announce Stargate AI data centre deal
AI Paper Summary Artificial Intelligence

BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI
ByRicardo June 25, 2025

Beijing Academy of Artificial Intelligence (BAAI) introduces OmniGen2, a next-generation, open-source multimodal generative model. Expanding on its predecessor OmniGen, the new architecture unifies text-to-image generation, image editing, and subject-driven generation within a single transformer framework. It innovates by decoupling the modeling of text and image generation, incorporating a reflective training mechanism, and implementing a purpose-built…

Read More BAAI Launches OmniGen2: A Unified Diffusion and Transformer Model for Multimodal AI

Liquid AI Released LFM2-Audio-1.5B: An End-to-End Audio Foundation Model with Sub-100 ms Response Latency