|

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time

NVIDIA’s Nemotron Speech workforce has launched Nemotron 3.5 ASR. It is a 600M-parameter streaming Automatic Speech Recognition (ASR) mannequin. A single checkpoint transcribes 40 language-locales in actual time. Punctuation and capitalization are constructed in natively. The mannequin ships as open weights on Hugging Face. The license is OpenMDW-1.1. The structure is a Cache-Aware QuickConformer-RNNT.

What is Nemotron 3.5 ASR

Nemotron 3.5 ASR extends nvidia/nemotron-speech-streaming-en-0.6b to many languages. It provides prompt-based language-ID conditioning to the bottom mannequin. That lets one 600M-parameter checkpoint cowl 40 language-locales. No per-language mannequin or model-swapping is required.

The mannequin targets two workloads. The first is low-latency streaming for dwell audio. The second is high-throughput batch transcription. Output is production-ready textual content with correct casing and punctuation. No separate punctuation-restoration step is required.

(*40*)Image supply: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b

How Cache-Aware QuickConformer-RNNT Works

The mannequin has two principal items. The first is a Cache-Aware QuickConformer encoder with 24 layers. QuickConformer is an environment friendly evolution of the Conformer structure. It makes use of linearly scalable consideration. The second piece is an RNNT (Recurrent Neural Network Transducer) decoder. RNNT emits textual content body by body as audio streams in.

The “cache-aware” design is the effectivity lever. Buffered streaming re-processes overlapping audio home windows at each step. That repeats the identical work and provides delay. This mannequin caches encoder self-attention and convolution activations as an alternative. It reuses these cached states as new audio arrives. So every audio body is processed precisely as soon as, with no overlap. Compute and end-to-end latency each drop, with out an accuracy penalty.

The Latency Knob: att_context_size

One inference setting controls the latency-accuracy tradeoff. It is the eye context measurement, att_context_size. Smaller context emits textual content sooner however sees much less future audio. Larger context raises accuracy at greater latency.

The similar checkpoint covers the total vary. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For instance, [56,0] provides an 80ms ultra-low-latency mode. The [56,13] setting provides 1.12s for highest accuracy. Teams decide the working level at inference time, with no retraining.

Language Detection and Coverage

The 40 language-locales embrace English, Spanish, German, and French variants. They additionally cowl Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several different European and Nordic languages are included too.

Language conditioning works two methods. Setting target_lang to a identified locale normally provides the very best accuracy. Setting target_lang=auto lets the mannequin detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language visitors. No separate language-ID part is required.

Comparison

Product Company Access Native streaming Language protection Reported latency Pricing mannequin
Nemotron 3.5 ASR NVIDIA Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra Yes — cache-aware QuickConformer-RNNT 40 language-locales 80ms–1.12s, configurable at inference Free to self-host; usage-based by way of host
Whisper large-v3 OpenAI Open weights (MIT), self-host; API No — offline/batch ~99 languages Not streaming-native Self-host free; API ~$0.006/min (batch)
Nova-3 Deepgram Closed API; on-premise/self-host (enterprise) Yes — streaming + batch Multilingual; +10 monolingual languages added Jan 2026 Low-latency streaming (reported sub-300ms) ~$0.0077/min (Nova-3 Monolingual, PAYG)
Universal-3 Pro Streaming AssemblyAI Closed API (EU endpoint out there) Yes 6 languages: English, Spanish, French, German, Italian, Portuguese Sub-300ms (official); first partial ~750ms Usage-based (PAYG)
Scribe v2 Realtime ElevenLabs Closed API Yes 90+ languages (99 per ElevenLabs) ~150ms (p50) ~$0.28/hour
Ursa / streaming Speechmatics API + on-premise + edge Yes — streaming + batch 50+ languages with computerized identification Ultra-low latency (positioned) Enterprise/utilization

Fine-Tuning Results

Because the weights are open, groups can fine-tune for a language, area, or accent. NVIDIA printed a labored instance on Greek and Bulgarian. It fine-tuned the bottom checkpoint with the identical Cache-Aware QuickConformer-RNNT recipe. Each clip carried a target_lang tag for language conditioning. Training knowledge got here from public corpora, together with Granary, Common Voice, and FLEURS.

Results had been measured as WER on held-out FLEURS, on the 80ms setting. Greek WER fell from 35 to 24, a 32% relative enchancment. Bulgarian fell from 22 to fifteen, a 31% relative enchancment. These are uncooked WER percentages on the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out knowledge, provides sincere numbers.

Strengths and Considerations

Strengths:

  • One 600M-parameter checkpoint covers 40 language-locales, slicing deployment sprawl.
  • Cache-aware streaming processes every body as soon as, reported at 17x buffered concurrency on an H100.
  • att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining.
  • Punctuation, capitalization, and auto language tagging are constructed in.
  • Open weights enabled a 31–32% relative WER drop on Greek and Bulgarian after fine-tuning.

Considerations:

  • The mannequin handles English, however NVIDIA recommends its devoted English mannequin for English-only use.
  • The 80ms mode trades some accuracy for the bottom latency.
  • Japanese and Korean use CER, so cross-language error comparisons want care.
  • Throughput figures are measured on H100, so outcomes on different GPUs will differ.
  • The manufacturing NIM with gRPC streaming is introduced, however not but launched.

Key Takeaways

  • NVIDIA’s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming mannequin transcribing 40 language-locales from one checkpoint.
  • Its Cache-Aware QuickConformer-RNNT design processes every audio body as soon as, reported at 17x the concurrent streams of buffered approaches on an H100.
  • Latency is configurable from 80ms to 1.12s at inference by way of att_context_size, with no retraining.
  • A quick fine-tune lower FLEURS WER 32% on Greek (35→24) and 31% on Bulgarian (22→15), on the 80ms setting.
  • It is self-hostable and streaming-native, in contrast to closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper.

Marktechpost’s Visual Explainer


NEMOTRON 3.5 ASR
1 / 10

NVIDIA · STREAMING SPEECH AI · OPEN WEIGHTS

Nemotron 3.5 ASR

A 600M-parameter cache-aware streaming mannequin that transcribes 40 language-locales in actual time, from a single checkpoint.

600M parameters
40 language-locales
80ms–1.12s latency
OpenMDW-1.1

01 — WHAT IT IS

One mannequin, 40 language-locales

  • Extends nvidia/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning.
  • A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping.
  • Punctuation and capitalization are constructed in. No separate post-processing step.
  • Targets two workloads: low-latency streaming and high-throughput batch.
  • NVIDIA nonetheless recommends its English-only mannequin for English-only use.

02 — ARCHITECTURE

Cache-Aware QuickConformer-RNNT

  • A 24-layer QuickConformer encoder paired with an RNNT decoder.
  • Buffered streaming re-processes overlapping audio home windows at each step.
  • This mannequin caches encoder self-attention and convolution states, then reuses them.
  • Each audio body is processed precisely as soon as, with no overlap.
  • Compute and end-to-end latency drop, with no accuracy penalty.

03 — THE LATENCY KNOB

One setting tunes latency vs. accuracy

att_context_size Chunk (latency) Use case
[56,0] 80ms (Ultra-Low) Ultra low latency voice brokers
[56,1] 160ms (Low) Interactive voice brokers
[56,3] 320ms (Balanced) Conversational AI, dwell caption
[56,6] 560ms (Medium) Higher accuracy, affordable latency
[56,13] 1.12s (High) Highest accuracy

Same checkpoint, chosen at inference time. No retraining required.

04 — LANGUAGES & DETECTION

Coverage and computerized language ID

  • 40 language-locales, together with English, Spanish, German, and French variants.
  • Also covers Arabic, Japanese, Korean, Mandarin, Hindi, and Thai.
  • Set target_lang to a identified locale for the very best accuracy.
  • Set target_lang=auto to let the mannequin detect the language.
  • In auto mode, it emits a language tag after terminal punctuation.
  • One deployment handles mixed-language visitors, with no separate language-ID part.

05 — THROUGHPUT

Half the scale, extra concurrent streams

  • NVIDIA compares it in opposition to Parakeet RNNT 1.1B multilingual, which makes use of buffered streaming.
  • Nemotron 3.5 ASR is roughly half the scale: 0.6B versus 1.1B.
  • The workforce reviews 17x the concurrent streams of buffered approaches, on the identical H100.
  • Avoiding redundant recomputation lowers the associated fee per stream in manufacturing.

The 17x determine is from the discharge announcement; the mannequin card states the qualitative declare straight.

06 — FINE-TUNING RESULTS

A quick fine-tune lifts weaker languages

Language Base WER Fine-tuned Relative
Greek 35 24 32%
Bulgarian 22 15 31%

Raw WER (%) on held-out FLEURS on the 80ms setting. Data: Granary, Common Voice, FLEURS.

07 — AVAILABILITY & ACCESS

Open weights, plus a hosted path

  • Weights on Hugging Face below the OpenMDW-1.1 license.
  • Runtime is NeMo 26.06 or newer. Input have to be mono-channel.
  • Hosted on DeepInfra, which provides phrase boosting for area vocabulary.
  • NVIDIA says a NIM launch is deliberate for later in the month, with gRPC streaming.
  • Stated GPU help: Ampere, Hopper, Blackwell, Lovelace, Turing, Volta, and Jetson.

08 — HOW IT COMPARES

Where it sits in the panorama

Product Access Streaming Languages
Nemotron 3.5 ASR Open weights Native 40 locales
Whisper large-v3 Open weights No (batch) ~99
Deepgram Nova-3 API / on-prem Native Multilingual
AssemblyAI U-3 Pro API Native 6
ElevenLabs Scribe v2 API Native 90+
Google Chirp / Azure API Native 100+ / 140+

Latency and WER aren’t straight comparable throughout distributors; this compares construction, not a rating.

09 — KEY TAKEAWAYS

The quick model

  • An open-weights 600M streaming mannequin transcribing 40 language-locales from one checkpoint.
  • Cache-aware design processes every body as soon as; reported 17x buffered concurrency on an H100.
  • Latency configurable from 80ms to 1.12s at inference, with no retraining.
  • A quick fine-tune lower FLEURS WER 32% (Greek) and 31% (Bulgarian).
  • Self-hostable and streaming-native, in contrast to closed APIs or offline Whisper.


Curated for AI engineers by Marktechpost — practitioner-first protection of AI & ML.


Check out the Model weightsAlso, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Aware Streaming Model Transcribing 40 Language-Locales in Real Time appeared first on MarkTechPost.

Similar Posts