|

Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder

▶

Interfaze, a younger YC’s startup, has open-sourced a brand new speech recognition mannequin. It is known as diffusion-gemma-asr-small. The mannequin transcribes audio by way of a diffusion decoder, not an autoregressive one. It is described as the primary multilingual audio diffusion ASR mannequin. One adapter handles six languages. The analysis workforce educated solely about 42M parameters on high of a frozen 26B spine. That is roughly 0.16% of the mannequin’s weights.

Here two phrases matter up entrance. Autoregressive fashions generate textual content one token at a time. Diffusion fashions refine all tokens in parallel. This mannequin makes use of the diffusion method for speech-to-text.

TL;DR

  • Claimed by the Interfaze workforce, to be the primary open-source multilingual diffusion ASR: six languages from a single ~42M-parameter adapter.
  • Transcribes via DiffusionGemma’s diffusion decoder utilizing uniform, random-token diffusion, not the absorbing <masks> scheme.
  • Transcription price scales with denoising steps, not transcript size.
  • Leads diffusion friends on LibriSpeech (6.6% WER vs Whisfusion’s 8.3%) however trails autoregressive Whisper.
  • The adapter ships underneath Apache-2.0; DiffusionGemma (Gemma phrases) and whisper-small (MIT) load individually.

What is diffusion-gemma-asr-small?

diffusion-gemma-asr-small is an audio-native ASR mannequin. It converts speech to textual content utilizing a discrete diffusion decoder. That decoder belongs to DiffusionGemma, Google’s 26B mixture-of-experts mannequin. DiffusionGemma prompts 4B parameters, utilizing 128 specialists with top-8 routing. It generates textual content by discrete diffusion as an alternative of autoregression.

The diffusion element is restricted. Most diffusion LLMs use an absorbing <masks> scheme. DiffusionGemma makes use of uniform, random-token diffusion as an alternative. It fills a fixed-length canvas with random vocabulary tokens. Each step retains assured predictions and re-randomizes the remainder. After a couple of steps the noise anneals into textual content.

Interfaze added audio to this text-only mannequin. Out of the field, DiffusionGemma takes textual content, photographs, and video. It doesn’t take audio. The repo ships solely the educated adapter, about 42M parameters. The frozen backbones obtain individually from their very own repos.

How it really works

The mannequin doesn’t feed uncooked waveforms to the LLM. An early try tried precisely that and failed. A frozen LLM has by no means seen a spectrogram. The embedding house has no notion of formants or phonemes. The mannequin realized to disregard audio and hallucinate fluent nonsense.

The working design makes use of a frozen whisper-small encoder. It acts solely as a characteristic extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames. Each body holds 768-dimensional acoustic options. A small trainable projector then compresses these frames. It makes use of conv layers that subsample 8× plus a linear map. The output is 188 “audio tokens” at 2816 dimensions. These tokens scatter into the immediate’s reserved <|audio|> slots. LoRA adapters let the spine attend to this new modality. The decoder then denoises a 192-token transcript canvas. It runs bidirectionally over roughly 16 steps.

The pipeline, from the mannequin card, is compact:

uncooked audio ─► whisper-small encoder (frozen) ─► projector (educated, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript

The coaching unlock

The first coaching runs stalled. Loss flatlined close to 8. The failure was round. The projector began random, so its output was noise. Attention then realized to disregard it. Almost no gradient reached the projector. The mannequin by no means realized.

The repair supervised the projector immediately. The analysis workforce ran the 188 audio tokens by way of DiffusionGemma’s frozen lm_head. They utilized a CTC loss in opposition to the transcript. CTC means Connectionist Temporal Classification. It aligns audio options to textual content with no need consideration.

This sidesteps the standoff. The audio embeddings grew to become linearly predictive of the appropriate phrases. CTC loss then dropped from 24 to eight.6 in 300 steps. On LibriSpeech test-clean, English WER fell 90% → 52% → 14.6% → 6.6% over ten epochs.

Performance and benchmarks

WER means Word Error Rate, the place decrease is healthier. CER means Character Error Rate. The mannequin educated on FLEURS, LibriSpeech, and VoxPopuli. All scores under use the Whisper textual content normalizer at 16 diffusion steps.

benchmark metric rating
LibriSpeech test-clean (en) WER 6.6%
FLEURS English WER 15.7%
VoxPopuli English WER 18.5%
FLEURS Hindi CER 15.8%
FLEURS Mandarin CER 29.6%

Against different diffusion or non-autoregressive ASR, it leads.

mannequin method LibriSpeech test-clean
TransFusion (2022) multinomial diffusion ~6–7% (proof-of-concept)
Whisfusion (Aug 2025) Whisper-large-v3 + masked diffusion 8.3%
diffusion-gemma-asr-small (2026) Whisper-small + DiffusionGemma 6.6%

Against autoregressive Whisper, it trails. The workforce frames this hole as information, not structure.

benchmark ours Whisper-small Whisper-large-v3
LibriSpeech clear 6.6% ~3.4% ~2.0%
FLEURS-en 15.7% ~9–10% ~4–5%
VoxPopuli-en 18.5% ~9–11% ~7–10%

The denoising-step sweep exhibits a virtually flat curve.

steps FLEURS-en WER pace
8 15.7% 14.9× real-time
16 15.6% 10.3×
32 15.2% 6.5×
48 15.6% 4.7×

Going from 8 to 48 steps buys about 0.1 WER level. It prices roughly 3× the latency. The mannequin converges in about 8 parallel passes. That is round 0.7–1.5s of mannequin time for a 10-second clip.

Use circumstances with examples

  • Batch transcription pipelines profit from parallel decoding. Cost is about by denoising steps, not clip size. A ten-second clip wants roughly the identical passes as a shorter one.
  • Multilingual transcription runs from a single adapter. It covers English, German, French, Spanish, Hindi, and Mandarin. Teams keep away from loading a separate mannequin per language.
  • Non-autoregressive ASR analysis positive factors a reproducible baseline. The recipe grounds a frozen LLM with a small adapter. Researchers can prolong it with extra audio or a bigger encoder.

How to get began

The mannequin lives on the Hub. It ships the adapter, mannequin.py, audio.py, and a runnable inference.py. DiffusionGemma help wants transformers from foremost.

pip set up torch peft soundfile librosa huggingface_hub 
  "transformers @ git+https://github.com/huggingface/transformers.git"

Then transcribe in Python:

import sys, soundfile as sf
from huggingface_hub import snapshot_download

repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small")   # adapter, ~170 MB
sys.path.insert(0, repo)
from inference import load, transcribe

# Loads frozen DiffusionGemma-26B + whisper-small + this adapter.
mannequin, tok, fe = load(f"{repo}/diffusion_asr_small.pt", system="cuda")

wav, sr = sf.learn("audio.wav")   # 16 kHz mono float32
print(transcribe(wav, mannequin, tok, fe, max_steps=16))

A command-line path additionally works from contained in the downloaded repo:

python inference.py audio.wav

The max_steps argument trades pace for accuracy. The workforce notes 8 is near-best and quickest. The default is 16. The base fashions load underneath their very own licenses: DiffusionGemma underneath Gemma phrases, whisper-small underneath MIT.

Interactive Explainer