|

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

Xiaomi’s MiMo crew launched MiMo-Audio, a 7-billion-parameter audio-language mannequin that runs a single next-token goal over interleaved textual content and discretized speech, scaling pretraining past 100 million hours of audio.

What’s really new?

Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio makes use of a bespoke RVQ (residual vector quantization) tokenizer that targets each semantic constancy and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (≈200 tokens/s), giving the LM entry to “lossless” speech options it could possibly mannequin autoregressively alongside textual content.

Architecture: patch encoder → 7B LLM → patch decoder

To deal with the audio/textual content price mismatch, the system packs 4 timesteps per patch for LM consumption (downsampling 25 Hz → 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ era scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three elements—patch encoder, MiMo-7B spine, and patch decoder—are skilled beneath a single next-token goal.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Scale is the algorithm

Training proceeds in two massive phases: (1) an “understanding” stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint “understanding + era” stage that turns on audio losses for speech continuation, S2T/T2S duties, and instruction-style knowledge. The report emphasizes a compute/knowledge threshold the place few-shot conduct seems to “swap on,” echoing emergence curves seen in giant text-only LMs.

Benchmarks: speech intelligence and basic audio

MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting robust scores throughout speech, sound, and music and a lowered “modality hole” between text-only and speech-in/speech-out settings. Xiaomi additionally releases MiMo-Audio-Eval, a public toolkit to breed these outcomes. Listen-and-respond demos (speech continuation, voice/emotion conversion, denoising, and speech translation) can be found on-line.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Why that is necessary?

The method is deliberately easy—no multi-head process tower, no bespoke ASR/TTS goals at pretraining time—simply GPT-style next-token prediction over lossless audio tokens plus textual content. The key engineering concepts are (i) a tokenizer the LM can really use with out throwing away prosody and speaker identification; (ii) patchification to maintain sequence lengths manageable; and (iii) delayed RVQ decoding to protect high quality at era time. For groups constructing spoken brokers, these design selections translate into few-shot speech-to-speech modifying and sturdy speech continuation with minimal task-specific finetuning.

6 Technical Takeaways:

  1. High-Fidelity Tokenization
    MiMo-Audio makes use of a customized RVQ tokenizer working at 25 Hz with 8 lively codebooks, making certain speech tokens protect prosody, timbre, and speaker identification whereas retaining them LM-friendly.
  2. Patchified Sequence Modeling
    The mannequin reduces sequence size by grouping 4 timesteps into one patch (25 Hz → 6.25 Hz), letting the 7B LLM deal with lengthy speech effectively with out discarding element.
  3. Unified Next-Token Objective
    Rather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains beneath a single next-token prediction loss throughout interleaved textual content and audio, simplifying structure whereas supporting multi-task generalization.
  4. Emergent Few-Shot Abilities
    Few-shot behaviors equivalent to speech continuation, voice conversion, emotion switch, and speech translation emerge as soon as coaching surpasses a large-scale knowledge threshold (~100M hours, trillions of tokens).
  5. Benchmark Leadership
    MiMo-Audio units state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 total), whereas minimizing the text-to-speech modality hole to only 3.4 factors.
  6. Open Ecosystem Release
    Xiaomi supplies the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and builders to check and prolong speech-to-speech intelligence in open-source pipelines.

Summary

MiMo-Audio demonstrates that high-fidelity, RVQ-based “lossless” tokenization mixed with patchified next-token pretraining at scale is ample to unlock few-shot speech intelligence with out task-specific heads. The 7B stack—tokenizer → patch encoder → LLM → patch decoder—bridges the audio/textual content price hole (25→6.25 Hz) and preserves prosody and speaker identification through delayed multi-layer RVQ decoding. Empirically, the mannequin narrows the textual content↔speech modality hole, generalizes throughout speech/sound/music benchmarks, and helps in-context S2S modifying and continuation.


Check out the Paper, Technical details and GitHub Page. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The put up Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens appeared first on MarkTechPost.

Similar Posts