Best Text-to-Speech TTS Models in 2026: A Benchmark-Based Comparison
Text-to-speech TTS moved quick over the previous 12 months. The line between artificial and human speech narrowed. Latency dropped beneath 100 milliseconds for some real-time techniques. Emotional management grew to become a normal function somewhat than a analysis demo. This information critiques the fashions that basically matter in 2026. It is written for AI professionals selecting a mannequin for manufacturing.
How to learn TTS benchmarks in 2026
Two benchmarks dominate in most neighborhood discussions. The first is the Artificial Analysis Speech Arena Leaderboard. It ranks fashions by blind human choice utilizing an ELO score. As of 2026 it evaluates dozens of manufacturing APIs. The second is the community-run TTS Arena on Hugging Face. It makes use of the identical blind A/B voting technique.
These leaderboards measure perceived high quality, not accuracy. They additionally change repeatedly. As of May 30, 2026, the Artificial Analysis Speech Arena lists Gemini 3.1 Flash TTS, Realtime TTS-2 (Research Preview), Sonic 3.5, Realtime TTS 1.5 Max, and Fun-Realtime-TTS-Preview as its high 5 by ELO. Those positions shifted throughout the prior weeks, and they’ll shift once more. Treat any single quantity as a point-in-time studying, not a hard and fast reality.
Accuracy wants separate measurement. Trelis Research examined ten fashions utilizing a round-trip character error price, or CER. The technique transcribes generated audio with an ASR mannequin, then compares it to the enter textual content. Mean opinion rating, or MOS, captures perceived naturalness. Both metrics have limits. Round-trip CER depends upon the ASR mannequin’s personal accuracy. The UTMOS high quality estimator was educated on audio as much as ten seconds, so longer samples present much less rating unfold.
Latency is the third axis. The related determine for voice brokers is time-to-first-audio, or TTFA. Time-to-first-byte, or TTFB, may be deceptive, since container headers carry no audio. Consistency issues as a lot because the median. A Gradium benchmark from May 2026 measured the interquartile vary throughout suppliers. Tail latency, not the common, determines person expertise at scale.
In quick, no benchmark is full. Quality, accuracy, latency, language protection, and worth all commerce off. The proper mannequin depends upon which axis your utility can not compromise.
Commercial leaders
#1 Inworld TTS-1.5 and Realtime TTS-2
Inworld AI is a analysis lab based by a group from Google and DeepMind. It launched TTS-1.5 on January 21, 2026. The mannequin targets real-time, consumer-scale purposes. Inworld experiences roughly 30 % extra expressive vary than TTS-1. It additionally experiences about 40 % higher stability, measured by way of phrase error price and output consistency.
TTS-1.5 ships in two tiers. The Mini tier is tuned for latency-sensitive workloads comparable to voice brokers and gaming. The Max tier balances increased stability with low latency. Inworld experiences P90 time-to-first-audio beneath 130 milliseconds for Mini and beneath 250 milliseconds for Max. The mannequin helps 15 languages and gives each on the spot {and professional} voice cloning.
Pricing is tiered by plan, not a single price. On the On-Demand and Creator plans, Inworld lists $25 per million characters for TTS 1.5 Mini and $35 for Realtime TTS-2 and TTS 1.5 Max. The Developer and Growth plans reduce these charges; Growth reaches $15 for Mini and $25 for Max and TTS-2. Enterprise pricing goes as little as $5 and $10 respectively. Note that TTS 1.5 covers 15 languages, whereas TTS-2 covers over 100.
Inworld later added Realtime TTS-2 in 2026. It is described as a closed-loop voice mannequin with stronger steering and expressiveness. Across a number of leaderboard snapshots, Inworld reported holding three of the highest 5 spots on the Artificial Analysis Speech Arena.
Inworld fits builders constructing voice brokers at client scale. The mixture of low latency and aggressive pricing is its most important draw.
#2 Google Gemini 3.1 Flash TTS
Google DeepMind launched Gemini 3.1 Flash TTS on April 15, 2026. It is a preview mannequin out there by way of the Gemini API, Google AI Studio, Vertex AI, and Google Vids. The mannequin introduces greater than 200 audio tags. These tags steer type, tone, pacing, accent, and scene route.
On Google’s personal report, the mannequin reached an ELO of 1,211 on the Artificial Analysis leaderboard. It helps 70-plus languages and native multi-speaker dialogue. Google constructed it on the Gemini household somewhat than a standalone speech stack. The mannequin treats technology as a language activity: it decides not solely what to say, however easy methods to say it.
The mannequin has documented limitations that matter for deployment. A TTS session has a 32,000-token context window, and Google’s docs state that Gemini TTS doesn’t help streaming. It is constructed for managed textual content recitation, not interactive voice brokers; the separate Live API is Google’s real-time path. Output high quality can drift on generations longer than a couple of minutes, so Google recommends chunking. The mannequin gives 30 prebuilt voices. All generated audio carries a SynthID watermark for AI-content identification.
Gemini 3.1 Flash TTS suits podcast and audiobook technology with fine-grained management. It is a robust default for groups already on Google Cloud.
#3 ElevenLabs v3
ElevenLabs launched Eleven v3 in alpha on June 5, 2025. It reached basic availability in early 2026, per the corporate’s announcement. ElevenLabs describes it as its most expressive mannequin. It launched inline audio tags formatted in lowercase sq. brackets. Examples embrace [whispers], [laughs], [sighs], and scene cues like [interrupting]. The mannequin helps greater than 70 languages.
The GA launch refined the alpha. ElevenLabs experiences customers most well-liked the brand new model about 72 % of the time. It additionally improved how the mannequin handles numbers, symbols, and specialised notation.
A key function is Text to Dialogue. It weaves a number of voices into one technology go. The mannequin matches prosody and emotional vary throughout audio system. It can deal with interruptions and shifting moods with restricted prompting.
Eleven v3 nonetheless requires extra immediate engineering than earlier fashions. It will not be constructed for real-time use. ElevenLabs states the bigger mannequin and higher-fidelity codec take longer to run. For real-time and conversational use, the corporate recommends Flash v2.5 as a substitute. Those fashions stream with low latency, across the 75-millisecond vary in vendor figures.
ElevenLabs v3 suits narrative content material, audiobooks, and character work the place high quality outweighs velocity. It stays a typical start line for high-quality voice manufacturing.
#4 MiniMax Speech 2.6 HD and later
MiniMax constructed a aggressive line of speech fashions with restricted consideration in English-speaking markets. Speech 2.6 HD gives sturdy expressiveness and help for 40-plus languages. It sits excessive on a number of leaderboard snapshots. One January 2026 studying positioned Speech 2.6 HD close to the highest on Artificial Analysis.
The Turbo variant targets brokers, maintaining latency beneath 250 milliseconds. MiniMax’s enchantment is its price-to-performance ratio. It delivers emotion management that competes with costlier flagships. Later HD variations, comparable to Speech 2.8 HD, seem in 2026 leaderboard snapshots at premium pricing.
MiniMax suits multilingual purposes that want expressiveness with out flagship pricing.
#5 Hume Octave 2
Hume AI takes a distinct design strategy. Octave 2 is a speech-language mannequin that reads for that means earlier than producing audio. It produces emotionally calibrated speech somewhat than making use of mounted pronunciation guidelines. The mannequin shifts supply by itself as a script strikes from calm to pressing. It does this with out express tags or directions.
The trade-offs are actual. Language protection is slim in comparison with multilingual flagships. Building cloned voices right into a manufacturing API requires a gross sales course of. Reported pricing varies extensively by supply and tier, from beneath $10 to over $100 per million characters. Confirm the present price with Hume earlier than budgeting.
Octave 2 suits purposes the place tone carries weight. Examples embrace companion brokers, mental-health instruments, and buyer interactions the place flat supply breaks the expertise.
#6 Cartesia Sonic 3 and Sonic 3.5
Cartesia optimizes for velocity. Sonic makes use of a State Space Model, or SSM, structure as a substitute of transformers. SSM inference scales linearly somewhat than quadratically with sequence size. This retains latency low beneath load. Cartesia experiences mannequin latency beneath 100 milliseconds, and an end-to-end time-to-first-audio close to 82 milliseconds on Sonic 3.5.
Sonic 3 was launched in late 2025. Sonic 3.5 adopted in May 2026 and is now the really helpful steady mannequin. Both help 42 languages, together with 9 Indian languages, with greater than 500 voices. Cartesia briefly held the number-one spot on the Artificial Analysis leaderboard with Sonic 3.5 earlier than others overtook it. The fashions add refined prosody, wider emotional vary, real-time laughter, and voice cloning from quick samples.
Sonic 3 suits real-time conversational brokers the place latency is the arduous constraint. It is a TTS-only system, so groups carry their very own speech-to-text and language mannequin.
#7 Speechify SIMBA 3.0
Speechify positions SIMBA 3.0 as a cost-efficient flagship. The firm reported a number-seven rank on the Artificial Analysis leaderboard in May 2026. Its reported ELO was about 1,159, at a listing worth close to $10 per million characters. That made it the lowest-priced mannequin in the reported high ten.
These figures come from Speechify’s personal announcement, so confirm them independently earlier than committing. SIMBA 3.0 suits groups searching for benchmark-competitive high quality at decrease value than premium flagships.
#8 OpenAI gpt-4o-mini-tts and the Realtime line
OpenAI introduced gpt-4o-mini-tts in March 2025. It is constructed on the GPT-4o-mini structure. Its most important function is steerability by way of natural-language directions. Developers can instruct the mannequin on easy methods to say one thing, not simply what. An instance instruction is “communicate in a peaceful, empathetic tone.” OpenAI additionally launched a playground for testing at OpenAI.fm.
OpenAI shipped an up to date snapshot, gpt-4o-mini-tts-2025-12-15, in December 2025. It experiences roughly 35 % decrease phrase error price on the Common Voice and FLEURS benchmarks. The replace additionally improved Custom Voices, which let organizations construct a branded voice from a reference pattern. The endpoint exposes 13 built-in voices and covers 50-plus languages. OpenAI costs it at $0.60 per million textual content enter tokens and $12 per million audio output tokens, which works out to roughly $0.015 per minute of audio. OpenAI calls it its latest and most dependable TTS mannequin; the older tts-1 and tts-1-hd stay out there.
For conversational brokers, OpenAI’s Realtime line superior additional. The Realtime API reached basic availability in August 2025. In May 2026, OpenAI launched GPT-Realtime-2, its first voice mannequin with GPT-5-class reasoning. It handles software calls, interruptions, and corrections throughout dwell speech-to-speech. OpenAI additionally added GPT-Realtime-Translate and GPT-Realtime-Whisper for dwell translation and transcription.
gpt-4o-mini-tts suits groups already on the OpenAI platform that want low-cost, instructable speech. The Realtime fashions swimsuit full speech-to-speech brokers.
Open-weight fashions
As of late May 2026, the general high tier of the Artificial Analysis leaderboard remained closed-source. Open weights nonetheless matter. They permit self-hosting, customization, on-device deployment, and management over information. They can take away per-character API prices, changed by your individual compute. But licenses differ. Some weights are permissive, whereas others are research-only and require a separate license for business use. Check the license earlier than constructing on any of them.
#01 Kokoro 82M
Kokoro is among the most effective open-weight fashions out there. It now not leads the open-weight rankings; on the present Artificial Analysis leaderboard it sits round an ELO of 1,058, behind Fish Audio S2 Pro, Step Audio EditX, and Voxtral TTS. It has simply 82 million parameters. The structure builds on StyleTTS2 and ISTFTNet. It avoids diffusion and encoder levels, which speeds technology.
In the Trelis “Tricky TTS” take a look at, Kokoro reached a 4.5 MOS and a 17 % CER. That was the best high quality rating among the many fashions examined there. It runs effectively on modest {hardware}, together with CPU. Hosted API charges run beneath $1 per million characters of enter, round $0.65 in one present itemizing. Its weights have been launched in late December 2024, with v1.0 following in 2025. It covers about 15 languages and is distributed beneath the Apache 2.0 license.
Kokoro suits cost-sensitive or edge deployments the place compact dimension and velocity matter. Emotion-markup and cross-lingual options stay experimental and are greatest supported in English.
#02 Fish Audio S2 Pro
Fish Audio S2 Pro is the highest-ranked open-weight mannequin on the present Artificial Analysis leaderboard, at an ELO close to 1,123. Fish Audio experiences coaching on greater than 10 million hours of audio throughout 80-plus languages. The 5-billion-parameter mannequin makes use of a Dual-Autoregressive structure with an RVQ audio codec. It helps open-domain emotion tags, native multi-speaker output, and latency beneath 150 milliseconds.
There is a vital license caveat. S2 Pro ships beneath the Fish Audio Research License, not a permissive open license. Research and non-commercial use are free. Commercial use requires a separate license from Fish Audio. The weights, fine-tuning code, and a streaming inference engine are all revealed. Self-hosting nonetheless wants actual GPU sources.
Fish Audio suits groups that need high open-weight high quality, supplied they safe a business license earlier than delivery.
#03 IndexTTS-2
IndexTTS-2, from IndexTeam, advances zero-shot TTS. Its standout function is exact length management. That makes it helpful for video dubbing, the place audio should match a hard and fast time window. The mannequin additionally separates timbre from emotion. Developers can management voice identification and emotional tone independently.
The structure incorporates GPT latent representations and a three-stage coaching course of. A tender instruction mechanism, constructed by fine-tuning Qwen3, guides emotional tone by way of textual content descriptions. Its authors report that IndexTTS-2 beats prior zero-shot techniques on phrase error price, speaker similarity, and emotional constancy throughout a number of datasets.
IndexTTS-2 suits skilled dubbing and expressive synthesis the place timing and management are important. Its dual-mode operation provides configuration complexity.
#04 CosyVoice 2
CosyVoice2-0.5B comes from the FunAudioLLM challenge. It has 0.5 billion parameters. Its focus is ultra-low-latency streaming synthesis. It helps zero-shot voice cloning. The small footprint makes it sensible for real-time, self-hosted pipelines.
CosyVoice 2 suits real-time purposes the place groups need an open streaming mannequin.
#05 VibeVoice
VibeVoice, from Microsoft, targets long-form technology. The 1.5-billion-parameter mannequin helps context lengths as much as 64,000 tokens. It can produce roughly 90 minutes of steady speech. That fits podcasts and lengthy narration.
It has clear constraints. It is educated on English and Chinese solely. It generates multi-speaker audio sequentially, with no overlapping speech. VibeVoice suits long-form, two-language initiatives that want prolonged continuity.
Other notable present fashions
The subject is wider than the simply the rating record. Several fashions seem on present leaderboards and deserve a spot on a shortlist. xAI shipped its personal Text to Speech mannequin in 2026. StepAudio 2.5 TTS seems amongst premium-priced high entries. Voxtral TTS, a 4-billion-parameter mannequin from Mistral introduced in March 2026, makes use of character-based pricing close to $0.016 per 1,000 characters. Step Audio EditX and Magpie-Multilingual rank among the many stronger open-weight choices. Alibaba’s Qwen3-TTS and Maya1 add additional open and multilingual decisions. None of those is a default, however every can win a selected temporary.
Choosing a mannequin by use case
The market is now not a single-winner race. Start with the job, then decide the software.
Real-time voice brokers: Latency is the binding constraint. Users won’t wait. Cartesia Sonic 3.5 leads on uncooked velocity with its SSM structure, close to 82 milliseconds end-to-end. Inworld’s realtime tiers pair low latency with low value. Deepgram Aura-2 is one other low-latency choice, reported beneath 90 milliseconds. ElevenLabs Flash v2.5 retains the identical voice library as offline workloads. For full speech-to-speech, think about OpenAI’s GPT-Realtime-2.
Long-form audiobooks and narration: Quality dominates and latency is irrelevant. ElevenLabs v3 units a excessive realism bar for narrative content material. Gemini 3.1 Flash TTS gives sturdy management, with chunking for lengthy scripts. Among open weights, VibeVoice handles prolonged continuity in English and Chinese.
Multilingual content material: Coverage and consistency matter most. Gemini 3.1 Flash TTS and ElevenLabs v3 each help 70-plus languages. MiniMax Speech covers 40-plus at decrease value. Fish Audio S2 Pro leads the open tier with 80-plus languages, however business use wants a paid license.
Character and dialogue work: Expressiveness and multi-speaker management lead. ElevenLabs v3 Text to Dialogue handles interruptions and overlapping turns. Gemini 3.1 Flash TTS provides scene route and per-speaker management. Inworld targets recreation characters particularly.
Emotional constancy: Hume Octave 2 reads for that means and adapts supply with out tags. It suits companion brokers and delicate interactions.
On-device and price management: Open weights take away API charges. Kokoro runs on CPU with a small footprint. CosyVoice 2 streams at low latency. Both commerce some high quality for management.
Dubbing: IndexTTS-2 gives length management to match audio to video timing. That functionality is uncommon amongst general-purpose fashions.
Marktechpost’s Visual Explainer
01 / 11
Key Takeaways
- No single mannequin wins; decide by your binding constraint — latency, high quality, language protection, or value.
- Current leaderboard high tier: Gemini 3.1 Flash TTS, Inworld Realtime TTS-2, Cartesia Sonic 3.5, ElevenLabs v3.
- Rankings shift weekly, so deal with any ELO snapshot as dated, not mounted.
- Cartesia Sonic 3.5 owns real-time latency at ~82ms end-to-end; Deepgram Aura-2 is a detailed second.
- ElevenLabs v3 went typically out there in early 2026 and leads expressive, multi-speaker narration.
- Gemini 3.1 Flash TTS has no streaming and a 32k-token restrict — it is recitation, not a dwell agent.
- Fish Audio S2 Pro is the highest open-weight mannequin however research-licensed; business use wants a paid license.
- Kokoro is essentially the most environment friendly open choice, however now not the highest-ranked open weight.
- Inworld pricing is tiered: $25/$35 on-demand, dropping to $5/$10 at enterprise quantity.
- Public benchmarks slim the sector; your individual take a look at by yourself textual content makes the decision.
Sources:
Benchmarks & leaderboards
Commercial fashions (official sources)
- Inworld — Realtime TTS · Inworld — Pricing
- Google — Gemini 3.1 Flash TTS announcement · Google Cloud — Gemini 3.1 Flash TTS · Gemini API — Speech generation docs
- ElevenLabs — Eleven v3 · ElevenLabs — v3 General Availability
- MiniMax — Speech 02 news
- Hume — Octave · Hume — Octave launch blog
- Cartesia — Sonic
- Speechify
- OpenAI — Text-to-speech docs · OpenAI — Next-generation audio models · OpenAI — Advancing voice intelligence (GPT-Realtime-2)
- Deepgram — Aura-2 text-to-speech
Open-weight fashions (mannequin playing cards & official pages)
- Fish Audio — S2 · Fish Audio S2 Pro (Hugging Face)
- Kokoro-82M (Hugging Face)
- IndexTTS-2 (Hugging Face)
- CosyVoice2-0.5B (Hugging Face)
- VibeVoice-1.5B (Hugging Face)
- Qwen3-TTS (Hugging Face)
- Maya1 (Hugging Face)
- Voxtral / Mistral · xAI
Also, be happy to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Best Text-to-Speech TTS Models in 2026: A Benchmark-Based Comparison appeared first on MarkTechPost.
