Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder
Interfaze, a younger YC’s startup, has open-sourced a brand new speech recognition mannequin. It is known as diffusion-gemma-asr-small. The mannequin transcribes audio by way of a diffusion decoder, not an autoregressive one. It is described as the primary multilingual audio diffusion ASR mannequin. One adapter handles six languages. The analysis workforce educated solely about 42M parameters on high of a frozen 26B spine. That is roughly 0.16% of the mannequin’s weights.
Here two phrases matter up entrance. Autoregressive fashions generate textual content one token at a time. Diffusion fashions refine all tokens in parallel. This mannequin makes use of the diffusion method for speech-to-text.
TL;DR
- Claimed by the Interfaze workforce, to be the primary open-source multilingual diffusion ASR: six languages from a single ~42M-parameter adapter.
- Transcribes via DiffusionGemma’s diffusion decoder utilizing uniform, random-token diffusion, not the absorbing
<masks>scheme. - Transcription price scales with denoising steps, not transcript size.
- Leads diffusion friends on LibriSpeech (6.6% WER vs Whisfusion’s 8.3%) however trails autoregressive Whisper.
- The adapter ships underneath Apache-2.0; DiffusionGemma (Gemma phrases) and whisper-small (MIT) load individually.
What is diffusion-gemma-asr-small?
diffusion-gemma-asr-small is an audio-native ASR mannequin. It converts speech to textual content utilizing a discrete diffusion decoder. That decoder belongs to DiffusionGemma, Google’s 26B mixture-of-experts mannequin. DiffusionGemma prompts 4B parameters, utilizing 128 specialists with top-8 routing. It generates textual content by discrete diffusion as an alternative of autoregression.
The diffusion element is restricted. Most diffusion LLMs use an absorbing <masks> scheme. DiffusionGemma makes use of uniform, random-token diffusion as an alternative. It fills a fixed-length canvas with random vocabulary tokens. Each step retains assured predictions and re-randomizes the remainder. After a couple of steps the noise anneals into textual content.
Interfaze added audio to this text-only mannequin. Out of the field, DiffusionGemma takes textual content, photographs, and video. It doesn’t take audio. The repo ships solely the educated adapter, about 42M parameters. The frozen backbones obtain individually from their very own repos.
How it really works
The mannequin doesn’t feed uncooked waveforms to the LLM. An early try tried precisely that and failed. A frozen LLM has by no means seen a spectrogram. The embedding house has no notion of formants or phonemes. The mannequin realized to disregard audio and hallucinate fluent nonsense.
The working design makes use of a frozen whisper-small encoder. It acts solely as a characteristic extractor, not a decoder. Whisper turns 30 seconds of audio into 1500 frames. Each body holds 768-dimensional acoustic options. A small trainable projector then compresses these frames. It makes use of conv layers that subsample 8× plus a linear map. The output is 188 “audio tokens” at 2816 dimensions. These tokens scatter into the immediate’s reserved <|audio|> slots. LoRA adapters let the spine attend to this new modality. The decoder then denoises a 192-token transcript canvas. It runs bidirectionally over roughly 16 steps.
The pipeline, from the mannequin card, is compact:
uncooked audio ─► whisper-small encoder (frozen) ─► projector (educated, ~19M)
─► scatter into <audio> token slots of DiffusionGemma's encoder
─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
─► transcript
The coaching unlock
The first coaching runs stalled. Loss flatlined close to 8. The failure was round. The projector began random, so its output was noise. Attention then realized to disregard it. Almost no gradient reached the projector. The mannequin by no means realized.
The repair supervised the projector immediately. The analysis workforce ran the 188 audio tokens by way of DiffusionGemma’s frozen lm_head. They utilized a CTC loss in opposition to the transcript. CTC means Connectionist Temporal Classification. It aligns audio options to textual content with no need consideration.
This sidesteps the standoff. The audio embeddings grew to become linearly predictive of the appropriate phrases. CTC loss then dropped from 24 to eight.6 in 300 steps. On LibriSpeech test-clean, English WER fell 90% → 52% → 14.6% → 6.6% over ten epochs.
Performance and benchmarks
WER means Word Error Rate, the place decrease is healthier. CER means Character Error Rate. The mannequin educated on FLEURS, LibriSpeech, and VoxPopuli. All scores under use the Whisper textual content normalizer at 16 diffusion steps.
| benchmark | metric | rating |
|---|---|---|
| LibriSpeech test-clean (en) | WER | 6.6% |
| FLEURS English | WER | 15.7% |
| VoxPopuli English | WER | 18.5% |
| FLEURS Hindi | CER | 15.8% |
| FLEURS Mandarin | CER | 29.6% |
Against different diffusion or non-autoregressive ASR, it leads.
| mannequin | method | LibriSpeech test-clean |
|---|---|---|
| TransFusion (2022) | multinomial diffusion | ~6–7% (proof-of-concept) |
| Whisfusion (Aug 2025) | Whisper-large-v3 + masked diffusion | 8.3% |
| diffusion-gemma-asr-small (2026) | Whisper-small + DiffusionGemma | 6.6% |
Against autoregressive Whisper, it trails. The workforce frames this hole as information, not structure.
| benchmark | ours | Whisper-small | Whisper-large-v3 |
|---|---|---|---|
| LibriSpeech clear | 6.6% | ~3.4% | ~2.0% |
| FLEURS-en | 15.7% | ~9–10% | ~4–5% |
| VoxPopuli-en | 18.5% | ~9–11% | ~7–10% |
The denoising-step sweep exhibits a virtually flat curve.
| steps | FLEURS-en WER | pace |
|---|---|---|
| 8 | 15.7% | 14.9× real-time |
| 16 | 15.6% | 10.3× |
| 32 | 15.2% | 6.5× |
| 48 | 15.6% | 4.7× |
Going from 8 to 48 steps buys about 0.1 WER level. It prices roughly 3× the latency. The mannequin converges in about 8 parallel passes. That is round 0.7–1.5s of mannequin time for a 10-second clip.
Use circumstances with examples
- Batch transcription pipelines profit from parallel decoding. Cost is about by denoising steps, not clip size. A ten-second clip wants roughly the identical passes as a shorter one.
- Multilingual transcription runs from a single adapter. It covers English, German, French, Spanish, Hindi, and Mandarin. Teams keep away from loading a separate mannequin per language.
- Non-autoregressive ASR analysis positive factors a reproducible baseline. The recipe grounds a frozen LLM with a small adapter. Researchers can prolong it with extra audio or a bigger encoder.
How to get began
The mannequin lives on the Hub. It ships the adapter, mannequin.py, audio.py, and a runnable inference.py. DiffusionGemma help wants transformers from foremost.
pip set up torch peft soundfile librosa huggingface_hub
"transformers @ git+https://github.com/huggingface/transformers.git"
Then transcribe in Python:
import sys, soundfile as sf
from huggingface_hub import snapshot_download
repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small") # adapter, ~170 MB
sys.path.insert(0, repo)
from inference import load, transcribe
# Loads frozen DiffusionGemma-26B + whisper-small + this adapter.
mannequin, tok, fe = load(f"{repo}/diffusion_asr_small.pt", system="cuda")
wav, sr = sf.learn("audio.wav") # 16 kHz mono float32
print(transcribe(wav, mannequin, tok, fe, max_steps=16))
A command-line path additionally works from contained in the downloaded repo:
python inference.py audio.wav
The max_steps argument trades pace for accuracy. The workforce notes 8 is near-best and quickest. The default is 16. The base fashions load underneath their very own licenses: DiffusionGemma underneath Gemma phrases, whisper-small underneath MIT.
Interactive Explainer
operate lockSchedule(n, steps){
// fraction locked after step ok (ease-out), then a hard and fast random reveal order
var order=[]; for(var i=0;i<n;i++) order.push(i);
for(var j=order.length-1;j>0;j–){var r=Math.flooring(Math.random()*(j+1)); var tmp=order[j];order[j]=order[r];order[r]=tmp;}
var perStep=[]; // quantity locked goal at every step
for(var ok=1;ok<=steps;ok++){
var f=1-Math.pow(1-k/steps,1.6);
perStep.push(Math.min(n, Math.spherical(f*n)));
}
perStep[steps-1]=n;
return {order:order, perStep:perStep};
}
operate play(){
if(state.operating) return;
state.operating=true; playBtn.disabled=true; playBtn.textContent=”Denoising…”;
setPillsDisabled(true);
var phrases=SAMPLES[state.lang], n=phrases.size, steps=state.steps;
var toks=canvasEl.querySelectorAll(‘.tok’);
var sched=lockSchedule(n, steps);
var lockedSet={}; var ok=0;
var dur = steps<=16 ? 140 : 95;
state.timer=setInterval(operate(){
ok++;
var goal=sched.perStep[k-1];
// lock subsequent tokens as much as goal
var lockedCount=Object.keys(lockedSet).size;
whereas(lockedCount<goal){
var idx=sched.order[lockedCount];
lockedSet[idx]=true; lockedCount++;
(operate(el){ el.textContent=el.getAttribute(‘data-word’);
el.classList.add(‘locking’);
setTimeout(operate(){ el.classList.take away(‘locking’); el.classList.add(‘locked’); },180);
})(toks[idx]);
}
// re-randomize still-noisy tokens
for(var i=0;i<n;i++){ if(!lockedSet[i]){ toks[i].textContent=rnd(NOISE); } }
// pipeline pulse
pipeNodes.forEach(operate(p,pi){ p.classList.toggle(‘locked’, pi<=Math.min(4, Math.flooring(ok/steps*4)+1)); });
// readouts
progEl.fashion.width=Math.spherical(ok/steps*100)+”%”;
rStep.innerHTML=ok+’ <small>/ ‘+steps+'</small>’;
rLocked.innerHTML=Object.keys(lockedSet).size+’ <small>/ ‘+n+'</small>’;
if(ok>=steps){ cease(); pipeNodes.forEach(operate(p){p.classList.add(‘locked’);});
playBtn.textContent=”✓ Done”; playBtn.disabled=false; setPillsDisabled(false); postSize(); }
}, dur);
}
operate cease(){ if(state.timer){clearInterval(state.timer); state.timer=null;} state.operating=false; }
operate setPillsDisabled(v){
doc.querySelectorAll(‘.tablet’).forEach(operate(p){ p.fashion.pointerEvents=v?’none’:’auto’; p.fashion.opacity=v?.6:1; });
}
operate selectLang(l){ state.lang=l; resetView(); }
operate chooseSteps(s){ state.steps=s; markSelectedBar(); resetView(); }
// —- init —-
buildPills(‘langPills’, LANGS, state.lang, “”, selectLang);
buildPills(‘stepPills’, STEP_LIST, state.steps, “step”, chooseSteps);
buildBars();
playBtn.onclick=play;
resetBtn.onclick=resetView;
resetView();
window.addEventListener(‘load’, postSize);
window.addEventListener(‘resize’, postSize);
})();
</script>
</physique>
</html>
”
title=”Diffusion ASR Denoising Visualizer”
scrolling=”no” loading=”lazy”
fashion=”width:100%;border:0;peak:940px;overflow:hidden;show:block”>
Check out the Model weights and Technical details. Also, be happy to comply with us on Twitter and don’t neglect to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Interfaze Ships diffusion-gemma-asr-small, an Open-Source Diffusion ASR Model Transcribing Six Languages via DiffusionGemma’s Parallel Denoising Decoder appeared first on MarkTechPost.
