Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights
Miso Labs has launched MisoTTS, an open-weights 8-billion-parameter text-to-speech mannequin. It generates expressive speech from each textual content and audio context. The mannequin makes use of residual vector quantization (RVQ) to widen its sonic vary. This avoids scaling a single flat vocabulary whereas holding parameter rely mounted.
What is MisoTTS
MisoTTS is an 8B-parameter text-to-dialogue RVQ Transformer. It is impressed by the Sesame CSM structure. It pairs a Llama 3.2-style spine with a smaller audio decoder. It generates Mimi audio codes from textual content and optionally available audio context. The mannequin situations on each textual content and prior audio. That second enter lets it reply to the speaker’s tone.
The textual content vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is the audio tokenizer, and max sequence size is 2,048. Default inference runs in torch.bfloat16.
Miso Labs claims 110ms latency. It lists ElevenLabs at 700ms and Sesame at 300ms.
The Vocabulary Size Problem
Standard transformers generate from a set vocabulary of discrete tokens. That works when a small vocabulary covers the goal area. Human speech doesn’t match that assumption. It varies throughout pitch, rhythm, emphasis, emotion, and accent.
Expanding the audio vocabulary is the apparent repair. But bigger vocabularies want extra parameters in a regular transformer. Each token should be represented and predicted by the mannequin. Miso Labs calls this the vocabulary measurement drawback.
The second subject is conditioning. Most TTS fashions situation solely on textual content. They ignore the interlocutor’s tone. Miso Labs argues this contributes to the “uncanny valley” impact.
Residual Vector Quantization: The Core Idea
MisoTTS addresses each issues with residual vector quantization (RVQ). Miso Labs traces RVQ to image-generation analysis and to Sesame’s CSM for audio. Instead of 1 token index, the mannequin emits a vector of indices.
Each audio token is 32 codebook indices over 2048-way codebooks. The mannequin retains a separate codebook for every place within the vector. To get well the sound, it sums the looked-up vectors. Each codebook provides one other refinement to the sign.
This is what makes the scaling work. Addressable vocabulary equals codebook measurement raised to the depth. Growing the depth provides no parameters to the mannequin. So MisoTTS reaches about 204832, or roughly 10105 addressable tokens. Miso Labs notes naive scaling would require a far bigger community.

The Two-Transformer Architecture
The mannequin splits right into a spine and a decoder. The spine is a 7.7B-parameter transformer, autoregressive over time. It predicts the primary codebook index and a closing hidden state.
A 300M-parameter decoder then runs autoregressively over depth. It predicts the remaining codebook indices, one place at a time. Each prediction situations on the indices already chosen within the body. The identical 300M parameters are reused for each place.
Embeddings comply with the identical logic. Text tokens use a single lookup. An audio token’s embedding is the sum of per-position codebook lookups. Interleaving textual content and audio lets the spine use dialog historical past. That is the way it carries context throughout turns.
Strengths and Challenges
Strengths:
- Open weights on day one, below a modified MIT license.
- RVQ scales the sonic vary with out scaling parameter rely.
- Conditions on audio context, not textual content alone.
- Local deployment retains delicate audio information in-house.
- The structure and math are documented in a public weblog put up.
Challenges:
- Half-duplex solely, with no turn-taking but.
- The giant mannequin wants a succesful CUDA GPU.
- API entry is introduced however not but accessible.
- Latency and high quality claims nonetheless want third-party testing.
Marktechpost’s Visual Explainer
01 / 09
Decoded by Marktechpost — AI analysis, mannequin briefs, and developer instruments for practitioners.
marktechpost.com
Key Takeaways
- Miso Labs open-sourced MisoTTS, an 8B text-to-speech mannequin, below a modified MIT license.
- It situations on each textual content and audio context, making generations conscious of speaker tone.
- Residual vector quantization (32 codebooks × 2048-way) scales vocabulary to ~2048³² with out including parameters.
- Architecture splits a 7.7B spine (over time) and a 300M decoder (over depth).
- It is half-duplex and single-turn solely right this moment; API entry remains to be pending.
Check out the Model Weights, Repo and Technical details. Also, be happy to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights appeared first on MarkTechPost.
