Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags
Supertone launched Supertonic 3, the third era of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language assist, improved studying accuracy, fewer repeat and skip failures, and v2-compatible public ONNX belongings. It is Lightning Fast, On-Device, Multilingual and Accurate TTS.
What Changed from v2 to v3
Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity throughout the shared-language set, and expands language protection from 5 to 31 languages. Version 2 supported English, Korean, Spanish, Portuguese, and French. Version 3 provides Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 complete ISO language codes. There can be a particular na fallback for textual content whose language is unknown or exterior the supported set.
The mannequin grows modestly to accommodate the added languages. At about 99M parameters throughout the general public ONNX belongings, Supertonic 3 is far smaller than 0.7B to 2B class open TTS techniques. The smaller mannequin measurement is a sensible benefit for obtain measurement, startup time, and on-device inference. The replace additionally brings the overall disk footprint of the general public ONNX belongings to 404 MB. Additionally, Supertone lately launched the Voice Builder, permitting builders to create customized, edge-native TTS fashions from their very own voice recordings.
Expressive Tags
One new functionality in v3 that wasn’t current in v2 is expressive tag assist. Supertonic 3 helps easy expression tags similar to <chuckle>, <breath>, and <sigh>. These allow you to embed prosodic cues straight into enter textual content with no separate preprocessing step or a separate mannequin for expressiveness. For engineers constructing voice interfaces or accessibility instruments, this implies you may specify respiration pauses or laughter inline in your textual content payload.
Architecture and Runtime
The underlying structure carries over from prior variations: a speech autoencoder that encodes waveforms into steady latent representations, a flow-matching based mostly text-to-latent module that maps textual content to audio options, and a period predictor that controls pure timing. Flow matching is a generative modeling method that learns a vector area to remodel a easy distribution right into a goal distribution — it samples sooner than diffusion fashions at low step counts, which is why Supertonic can produce usable output in simply 2 inference steps. To additional refine output, v3 integrates Length-Aware Rotary Position Embedding (LARoPE) for superior text-speech alignment and makes use of a Self-Purifying Flow Matching method throughout coaching to stay sturdy towards noisy knowledge labels.
On runtime effectivity, Supertonic 3 runs quick on CPU, even in contrast with bigger baselines measured on A100 GPU, and makes use of considerably much less reminiscence. It doesn’t require a GPU, which makes native, browser, and edge deployment a lot simpler.
Reading Accuracy
Across measured languages, Supertonic 3 stays inside a aggressive WER/CER vary towards a lot bigger open TTS fashions similar to VoxCPM2, whereas preserving a light-weight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are commonplace TTS readability metrics: you synthesize a passage, run ASR over the output, and examine the transcription to the unique textual content. CER is used for languages with out clear phrase boundaries; the others use WER. The system’s effectivity is greatest demonstrated on excessive edge {hardware}; it achieves a mean RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Furthermore, the ecosystem has expanded to incorporate Flutter (with macOS assist), .NET 9, and Go, whereas the online implementation leverages onnxruntime-web for pure client-side execution.
Text Normalization
A differentiating property carried ahead from v2 is built-in textual content normalization. Supertonic handles complicated floor types — monetary expressions like $5.2M, telephone numbers with space codes and extensions like (212) 555-0142 ext. 402, time and date codecs like 4:45 PM on Wed, Apr 3, 2024, and technical items like 2.3h and 30kph — with none preprocessing pipeline or phonetic annotations. The monetary expression “$5.2M” should learn as “5 level two million {dollars},” and “$450K” as “4 hundred fifty thousand {dollars}.” All 4 competing techniques failed this. The technical unit “2.3h” should learn as “two level three hours” and “30kph” as “thirty kilometers per hour.” All 4 opponents additionally failed this class. The competing techniques evaluated embrace ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

Getting Started
The Python SDK set up is pip set up supertonic. On first run, the SDK downloads the mannequin belongings from Hugging Face routinely. A minimal instance:
from supertonic import TTS
tts = TTS(auto_download=True)
fashion = tts.get_voice_style(voice_name="M1")
textual content = "A mild breeze moved by the open window whereas everybody listened to the story."
wav, period = tts.synthesize(textual content, voice_style=fashion, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {period:.2f}s of audio")
Marktechpost’s Visual Explainer
Key Takeaways
- Supertonic 3 expands language assist from 5 (v2) to 31 languages, rising from 66M to ~99M parameters with a complete ONNX asset measurement of 404 MB
- New in v3: expressive tags (
<chuckle>,<breath>,<sigh>), extra secure studying on quick and lengthy utterances, and improved speaker similarity vs. v2 - v2-compatible public ONNX interface — present integrations improve with out altering inference code
- Reading accuracy benchmarked towards VoxCPM2; v3 stays inside a aggressive WER/CER vary whereas being considerably smaller
- v3-specific RTF/throughput numbers haven't been printed; the 167× faster-than-real-time determine is a v2 benchmark and shouldn't be assumed similar for v3
- Native output of 16-bit WAV recordsdata making certain high-fidelity audio for engineering purposes
Check out the GitHub Repo and Hugging Face Space. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags appeared first on MarkTechPost.















