Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

Supertone launched Supertonic 3, the third era of its on-device, ONNX-based text-to-speech system. Supertonic 3 ships with 31-language assist, improved studying accuracy, fewer repeat and skip failures, and v2-compatible public ONNX belongings. It is Lightning Fast, On-Device, Multilingual and Accurate TTS.

What Changed from v2 to v3

Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity throughout the shared-language set, and expands language protection from 5 to 31 languages. Version 2 supported English, Korean, Spanish, Portuguese, and French. Version 3 provides Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — 31 complete ISO language codes. There can be a particular na fallback for textual content whose language is unknown or exterior the supported set.

The mannequin grows modestly to accommodate the added languages. At about 99M parameters throughout the general public ONNX belongings, Supertonic 3 is far smaller than 0.7B to 2B class open TTS techniques. The smaller mannequin measurement is a sensible benefit for obtain measurement, startup time, and on-device inference. The replace additionally brings the overall disk footprint of the general public ONNX belongings to 404 MB. Additionally, Supertone lately launched the Voice Builder, permitting builders to create customized, edge-native TTS fashions from their very own voice recordings.

Expressive Tags

One new functionality in v3 that wasn’t current in v2 is expressive tag assist. Supertonic 3 helps easy expression tags similar to <chuckle>, <breath>, and <sigh>. These allow you to embed prosodic cues straight into enter textual content with no separate preprocessing step or a separate mannequin for expressiveness. For engineers constructing voice interfaces or accessibility instruments, this implies you may specify respiration pauses or laughter inline in your textual content payload.

Architecture and Runtime

The underlying structure carries over from prior variations: a speech autoencoder that encodes waveforms into steady latent representations, a flow-matching based mostly text-to-latent module that maps textual content to audio options, and a period predictor that controls pure timing. Flow matching is a generative modeling method that learns a vector area to remodel a easy distribution right into a goal distribution — it samples sooner than diffusion fashions at low step counts, which is why Supertonic can produce usable output in simply 2 inference steps. To additional refine output, v3 integrates Length-Aware Rotary Position Embedding (LARoPE) for superior text-speech alignment and makes use of a Self-Purifying Flow Matching method throughout coaching to stay sturdy towards noisy knowledge labels.

On runtime effectivity, Supertonic 3 runs quick on CPU, even in contrast with bigger baselines measured on A100 GPU, and makes use of considerably much less reminiscence. It doesn’t require a GPU, which makes native, browser, and edge deployment a lot simpler.

Reading Accuracy

Across measured languages, Supertonic 3 stays inside a aggressive WER/CER vary towards a lot bigger open TTS fashions similar to VoxCPM2, whereas preserving a light-weight on-device deployment path. WER (Word Error Rate) and CER (Character Error Rate) are commonplace TTS readability metrics: you synthesize a passage, run ASR over the output, and examine the transcription to the unique textual content. CER is used for languages with out clear phrase boundaries; the others use WER. The system’s effectivity is greatest demonstrated on excessive edge {hardware}; it achieves a mean RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) in airplane mode. Furthermore, the ecosystem has expanded to incorporate Flutter (with macOS assist), .NET 9, and Go, whereas the online implementation leverages onnxruntime-web for pure client-side execution.

Text Normalization

A differentiating property carried ahead from v2 is built-in textual content normalization. Supertonic handles complicated floor types — monetary expressions like $5.2M, telephone numbers with space codes and extensions like (212) 555-0142 ext. 402, time and date codecs like 4:45 PM on Wed, Apr 3, 2024, and technical items like 2.3h and 30kph — with none preprocessing pipeline or phonetic annotations. The monetary expression “$5.2M” should learn as “5 level two million {dollars},” and “$450K” as “4 hundred fifty thousand {dollars}.” All 4 competing techniques failed this. The technical unit “2.3h” should learn as “two level three hours” and “30kph” as “thirty kilometers per hour.” All 4 opponents additionally failed this class. The competing techniques evaluated embrace ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

https://github.com/supertone-inc/supertonic

Getting Started

The Python SDK set up is pip set up supertonic. On first run, the SDK downloads the mannequin belongings from Hugging Face routinely. A minimal instance:

Copy Code

from supertonic import TTS
tts = TTS(auto_download=True)
fashion = tts.get_voice_style(voice_name="M1")
textual content = "A mild breeze moved by the open window whereas everybody listened to the story."
wav, period = tts.synthesize(textual content, voice_style=fashion, lang="en")
tts.save_audio(wav, "output.wav")
print(f"Generated {period:.2f}s of audio")

Marktechpost’s Visual Explainer

Overview

Supertonic 3: On-Device TTS,
Now in 31 Languages

Supertonic 3 is a light-weight, open-weight text-to-speech system by Supertone Inc. It runs completely by way of ONNX Runtime in your system — no cloud, no API name, no knowledge leaving your machine. v3 expands from 5 to 31 languages, provides expressive tags, reduces studying failures, and stays suitable with the v2 ONNX interface.

31
Languages

~99M
Parameters

404 MB
ONNX Assets

MIT
Code License

What’s New in v3

Four Core Improvements Over Supertonic 2

Version 3 is a targeted improve — similar inference contract, meaningfully higher output.

31 languages — Expanded from the 5-language v2 launch (en, ko, es, pt, fr). Now consists of Japanese, Arabic, German, Hindi, Russian, Turkish, Vietnamese, and 20 extra ISO codes, plus a particular na fallback for unknown languages.
More secure studying — Fewer repeat and skip failures, particularly on quick and lengthy utterances. This was a recognized limitation in v2 that v3 straight addresses.
Expression tags — Supports <chuckle>, <breath>, and <sigh> inline in textual content, with none separate preprocessing or exterior mannequin.
Higher speaker similarity — Improved similarity throughout the shared-language set in contrast with Supertonic 2. Voices are extra constant throughout languages.

Installation

Get Running in Under a Minute

Install the Python SDK by way of pip. On first run, mannequin belongings are downloaded routinely from Hugging Face — no handbook setup required.

pip set up supertonic

Quick Start

Basic Python Usage

The SDK auto-downloads mannequin belongings on first run. Specify a voice, move your textual content with a language code, and save the WAV output.

Python

from supertonic import TTS

# Auto-downloads ONNX belongings on first run
tts = TTS(auto_download=True)

# Select a preset voice (M1—M5 male, F1—F5 feminine)
fashion = tts.get_voice_style(voice_name="M1")

textual content = "A mild breeze moved by the open window."

# synthesize() returns (wav_array, duration_in_seconds)
wav, period = tts.synthesize(textual content, voice_style=fashion, lang="en")

tts.save_audio(wav, "output.wav")
print(f"Generated {period:.2f}s of audio")

Python — With Expression Tags

textual content = "I can not imagine it <chuckle> that really labored!"
wav, period = tts.synthesize(textual content, voice_style=fashion, lang="en")

Languages

31 Supported Languages + `na` Fallback

All 31 languages share the identical mannequin structure and ONNX inference pipeline. Use the na code for textual content whose language is unknown or exterior the supported set.

en English

ko Korean

ja Japanese

ar Arabic

bg Bulgarian

cs Czech

da Danish

de German

el Greek

es Spanish

et Estonian

fi Finnish

fr French

hello Hindi

hr Croatian

hu Hungarian

id Indonesian

it Italian

lt Lithuanian

lv Latvian

nl Dutch

pl Polish

pt Portuguese

ro Romanian

ru Russian

sk Slovak

sl Slovenian

sv Swedish

tr Turkish

uk Ukrainian

vi Vietnamese

Text Normalization

Handles Complex Inputs Without Pre-Processing

Supertonic 3 reads monetary expressions, dates, telephone numbers, and technical items appropriately out of the field — no G2P module or phonetic annotations required. Below: Supertonic vs. 4 main business/open-source techniques.

Category	Input Example	Supertonic 3	ElevenLabs / OpenAI / Gemini / Microsoft
Financial Expression	$5.2M / $450K	✓	✗ All 4 failed
Time & Date	4:45 PM, Wed Apr 3	✓	✗ All 4 failed
Phone Number	(212) 555-0142 ext. 402	✓	✗ All 4 failed
Technical Unit	2.3h at 30kph	✓	✗ All 4 failed

Deployment & Resources

Runs Everywhere — 11 Platforms, No GPU Required

The public ONNX belongings run on CPU in fixed-voice mode with no GPU dependency. Browser assist is by way of WebGPU and WASM by onnxruntime-web. Audio output is 16-bit WAV; batch inference is supported.

PythonONNX Runtime

Node.jsServer-side JS

BrowserWebGPU / WASM

JavaJVM

C++High-perf

C#.NET

GoGo runtime

Swift / iOSNative

RustSystems

FlutterCross-platform

Code: MITLicense

Model: OpenRAIL-MLicense

GitHub Repo
HF Model
Live Demo
PyPI

Key Takeaways

Supertonic 3 expands language assist from 5 (v2) to 31 languages, rising from 66M to ~99M parameters with a complete ONNX asset measurement of 404 MB
New in v3: expressive tags (<chuckle>, <breath>, <sigh>), extra secure studying on quick and lengthy utterances, and improved speaker similarity vs. v2
v2-compatible public ONNX interface — present integrations improve with out altering inference code
Reading accuracy benchmarked towards VoxCPM2; v3 stays inside a aggressive WER/CER vary whereas being considerably smaller
v3-specific RTF/throughput numbers haven't been printed; the 167× faster-than-real-time determine is a v2 benchmark and shouldn't be assumed similar for v3
Native output of 16-bit WAV recordsdata making certain high-fidelity audio for engineering purposes

Check out the GitHub Repo and Hugging Face Space. Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags appeared first on MarkTechPost.

Supertone Releases Supertonic v3: On-Device Text-to-Speech Model with 31-Language Support, Fewer Reading Failures, and Expression Tags

What Changed from v2 to v3

Expressive Tags

Architecture and Runtime

Reading Accuracy

Text Normalization

Getting Started

Marktechpost’s Visual Explainer

Supertonic 3: On-Device TTS,
Now in 31 Languages

Four Core Improvements Over Supertonic 2

Get Running in Under a Minute

Basic Python Usage

31 Supported Languages + `na` Fallback

Handles Complex Inputs Without Pre-Processing

Runs Everywhere — 11 Platforms, No GPU Required

Key Takeaways

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model

Unsloth AI and NVIDIA are Revolutionizing Local LLM Fine-Tuning: From RTX Desktops to DGX Spark

How to Build a Fully Interactive Multi-Page NiceGUI Application with Real-Time Dashboard, CRUD Operations, File Upload, and Async Chat

Verifiable execution for AI agents

7 Essential Layers for Building Real-World AI Agents in 2025: A Comprehensive Framework

TII Abu-Dhabi Released Falcon H1R-7B: A New Reasoning Model Outperforming Others in Math and Coding with only 7B Params with 256k Context Window

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What Changed from v2 to v3

Expressive Tags

Architecture and Runtime

Reading Accuracy

Text Normalization

Getting Started

Marktechpost’s Visual Explainer

Supertonic 3: On-Device TTS,Now in 31 Languages

Four Core Improvements Over Supertonic 2

Get Running in Under a Minute

Basic Python Usage

31 Supported Languages + na Fallback

Handles Complex Inputs Without Pre-Processing

Runs Everywhere — 11 Platforms, No GPU Required

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Supertonic 3: On-Device TTS,
Now in 31 Languages

31 Supported Languages + `na` Fallback