Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning
Neuphonic has launched NeuTTS Air, an open-source text-to-speech (TTS) speech language mannequin designed to run regionally in actual time on CPUs. The Hugging Face model card lists 748M parameters (Qwen2 structure) and ships in GGUF quantizations (This fall/Q8), enabling inference by way of llama.cpp
/llama-cpp-python
with out cloud dependencies. It is licensed underneath Apache-2.0 and features a runnable demo and examples.
So, what’s new?
NeuTTS Air {couples} a 0.5B-class Qwen spine with Neuphonic’s NeuCodec audio codec. Neuphonic positions the system as a “super-realistic, on-device” TTS LM that clones a voice from ~3 seconds of reference audio and synthesizes speech in that type, focusing on voice brokers and privacy-sensitive purposes. The mannequin card and repository explicitly emphasize real-time CPU technology and small-footprint deployment.
Key Features
- Realism at sub-1B scale: Human-like prosody and timbre preservation for a ~0.7B (Qwen2-class) text-to-speech LM.
- On-device deployment: Distributed in GGUF (This fall/Q8) with CPU-first paths; appropriate for laptops, telephones, and Raspberry Pi-class boards.
- Instant speaker cloning: Style switch from ~3 seconds of reference audio (reference WAV + transcript).
- Compact LM+codec stack: Qwen 0.5B spine paired with NeuCodec (0.8 kbps / 24 kHz) to steadiness latency, footprint, and output high quality.
Explain the mannequin structure and runtime path?
- Backbone: Qwen 0.5B used as a light-weight LM to situation speech technology; the hosted artifact is reported as 748M params underneath the qwen2 structure on Hugging Face.
- Codec: NeuCodec offers low-bitrate acoustic tokenization/decoding; it targets 0.8 kbps with 24 kHz output, enabling compact representations for environment friendly on-device use.
- Quantization & format: Prebuilt GGUF backbones (This fall/Q8) can be found; the repo consists of directions for
llama-cpp-python
and an non-compulsory ONNX decoder path. - Dependencies: Uses
espeak
for phonemization; examples and a Jupyter pocket book are offered for end-to-end synthesis.
On-device efficiency focus
NeuTTS Air showcases ‘real-time technology on mid-range gadgets‘ and gives CPU-first defaults; GGUF quantization is meant for laptops and single-board computer systems. While no fps/RTF numbers are printed on the cardboard, the distribution targets native inference with no GPU and demonstrates a working movement by way of the offered examples and Space.
Voice cloning workflow
NeuTTS Air requires (1) a reference WAV and (2) the transcript textual content for that reference. It encodes the reference to type tokens after which synthesizes arbitrary textual content within the reference speaker’s timbre. The Neuphonic workforce recommends 3–15 s clear, mono audio and offers pre-encoded samples.
Privacy, duty, and watermarking
Neuphonic frames the mannequin for on-device privateness (no audio/textual content leaves the machine with out consumer’s approval) and notes that every one generated audio features a Perth (Perceptual Threshold) watermarker to help accountable use and provenance.
How it compares?
Open, native TTS techniques exist (e.g., GGUF-based pipelines), however NeuTTS Air is notable for packaging a small LM + neural codec with on the spot cloning, CPU-first quantizations, and watermarking underneath a permissive license. The “world’s first super-realistic, on-device speech LM” phrasing is the seller’s declare; the verifiable details are the dimension, codecs, cloning process, license, and offered runtimes.
Our Comments
The focus is on system trade-offs: a ~0.7B Qwen-class spine with GGUF quantization paired with NeuCodec at 0.8 kbps/24 kHz is a realistic recipe for real-time, CPU-only TTS that preserves timbre utilizing ~3–15 s type references whereas preserving latency and reminiscence predictable. The Apache-2.0 licensing and built-in watermarking are deployment-friendly, however publishing RTF/latency on commodity CPUs and cloning-quality vs. reference-length curves would allow rigorous benchmarking towards present native pipelines. Operationally, an offline path with minimal dependencies (eSpeak, llama.cpp/ONNX) lowers privateness/compliance danger for edge brokers with out sacrificing intelligibility.
Check out the Model Card on Hugging Face and GitHub Page. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning appeared first on MarkTechPost.