IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference

IBM launched two new open speech recognition fashions— Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — and they make a compelling case for what a ~2B-parameter speech mannequin can do. Both can be found on Hugging Face below the Apache 2.0 license.

The pair targets a particular drawback that enterprise AI groups know nicely: most production-grade computerized speech recognition (ASR) methods both demand large compute or sacrifice accuracy to remain inside finances. IBM’s guess is that cautious structure choices can let you’ve gotten it each methods.

What These Models Actually Do

Granite Speech 4.1 2B is a compact and environment friendly speech-language mannequin designed for multilingual computerized speech recognition (ASR) and bidirectional computerized speech translation (AST) masking English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, focuses solely on ASR — particularly concentrating on latency-sensitive deployments — and helps English, French, German, Spanish, and Portuguese, however not Japanese. That’s a significant distinction: groups that want Japanese transcription or any speech translation functionality ought to attain for the usual autoregressive mannequin.

IBM additionally quietly launched a 3rd variant alongside these two. Granite Speech 4.1 2B-Plus provides speaker-attributed ASR and word-level timestamps for functions the place figuring out who mentioned what — and precisely when — is a requirement.

Word Error Rate (WER) is the first metric for measuring transcription high quality. Lower is healthier. A WER of 5% means roughly 5 out of each 100 phrases are fallacious. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B scores a imply WER of 5.33. Drilling into benchmark element — on LibriSpeech clear, the mannequin achieves a WER of 1.33, and 2.5 on LibriSpeech different.

The Architecture, Explained

Both fashions share the identical three-component design at a excessive stage — a speech encoder, a modality adapter, and a language mannequin — although the decoding mechanism diverges considerably.

The first part is the speech encoder. The structure makes use of 16 conformer blocks educated with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE models — utilizing body significance sampling to deal with informative elements of the audio. A Conformer is a neural community layer that mixes convolutional layers (good at capturing native acoustic patterns) with consideration mechanisms (good at capturing long-range dependencies). CTC is a coaching approach that lets the mannequin study from audio-text pairs while not having actual frame-level alignment.

The second part is a speech-text modality adapter. A 2-layer window question transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the final conformer block, downsampling by an element of 5 utilizing 3 trainable queries per block and per layer — for a complete temporal downsampling issue of 10 — leading to a 10Hz acoustic embedding price for the LLM. This adapter bridges the hole between steady acoustic options and discrete textual content tokens, compressing the audio illustration so the language mannequin can course of it effectively. In the NAR mannequin, the Q-Former has 160M parameters and downsamples the concatenated hidden representations from 4 encoder layers (layers 4, 8, 12, and 16).

The third part is the language mannequin. Granite Speech 4.1 2B makes use of an intermediate checkpoint of granite-4.0-1b-base with 128k context size, fine-tuned on all coaching corpora. In the NAR variant, this turns into a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal consideration masks eliminated to allow bidirectional context — tailored with LoRA at rank 128 utilized to each consideration and MLP layers.

The Autoregressive vs. Non-Autoregressive Tradeoff

This is the place the 2 fashions diverge most sharply, and it has direct penalties for manufacturing deployment.

In the usual Granite Speech 4.1 2B, textual content is generated autoregressively — one token at a time, every relying on each token earlier than it. This produces correct, steady transcripts with full assist for AST, keyword-biased recognition, and punctuation, however is inherently sequential and slower at scale.

Granite Speech 4.1 2B-NAR takes a basically completely different strategy. Rather than decoding tokens one after the other, it edits a CTC speculation in a single ahead move utilizing a bidirectional LLM, reaching aggressive accuracy with quicker inference than autoregressive alternate options. This is the NLE (Non-autoregressive LLM-based Editing) structure. Concretely: the CTC encoder produces a tough preliminary transcript, that speculation is interleaved with insertion slots, and then a bidirectional LLM predicts edits — copy, insert, delete, or exchange — in any respect positions concurrently in a single move.

The NAR mannequin measured an RTFx of roughly 1820 on a single H100 GPU utilizing batched inference at batch measurement 128. RTFx (real-time issue multiplier) measures what number of occasions quicker than actual time a mannequin can course of audio — an RTFx of 1820 means a one-hour audio file will be transcribed in below two seconds on that {hardware}. One sensible constraint engineers ought to be aware: the NAR mannequin requires flash_attention_2 for inference, since this backend helps sequence packing and respects the is_causal=False flag.

Training Data and Infrastructure

The two fashions had been educated on completely different datasets. The normal mannequin was educated on 174,000 hours of audio from public corpora for ASR and AST, in addition to artificial datasets tailor-made to assist Japanese ASR, keyword-biased ASR, and speech translation. The NAR mannequin was educated on roughly 130,000 hours of speech throughout 5 languages utilizing publicly obtainable datasets together with CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.

The infrastructure hole between the 2 is equally telling. The normal mannequin’s coaching was accomplished in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR mannequin educated in simply 3 days on 16 H100 GPUs (2 nodes) for 5 epochs — a a lot lighter coaching run, which displays the architectural simplicity of enhancing over full autoregressive era.

Key Takeaways

Here are 5 quick key takeaways:

IBM launched two open ASR fashions — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — each ~2B parameters, and Apache 2.0 licensed.
The normal mannequin achieves a imply WER of 5.33 on the Open ASR Leaderboard, helps 6 languages for ASR (together with Japanese), bidirectional speech translation, key phrase biasing, and punctuation/truecasing — aggressive with fashions a number of occasions its measurement.
The NAR mannequin trades capabilities for pace — it drops Japanese, AST, and key phrase biasing, however delivers an RTFx of ~1820 on a single H100 GPU by enhancing a CTC speculation in a single ahead move fairly than producing tokens one after the other.
The structure has three core parts — a 16-layer Conformer encoder educated with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding price, and a fine-tuned granite-4.0-1b-base language mannequin.
A 3rd variant, Granite Speech 4.1 2B-Plus, additionally exists — extending the usual mannequin with speaker-attributed ASR and word-level timestamps for functions the place speaker id and exact timing are required.

Check out the Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Also, be happy to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference appeared first on MarkTechPost.