Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens
Xiaomi’s MiMo crew launched MiMo-Audio, a 7-billion-parameter audio-language mannequin that runs a single next-token goal over interleaved textual content and discretized speech, scaling pretraining past 100 million hours of audio. What’s really new? Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio makes use of a bespoke RVQ (residual vector quantization) tokenizer that…
