|

Meet โ€˜Kani-TTS-2โ€™: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support

The landscape of generative audio is shifting toward efficiency. A new open-source contender, Kani-TTS-2, has been released by the team at nineninesix.ai. This model marks a departure from heavy, compute-expensive TTS systems. Instead, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.

Kani-TTS-2 offers a lean, high-performance alternative to closed-source APIs. It is currently available on Hugging Face in both English (EN) and Portuguese (PT) versions.

The Architecture: LFM2 and NanoCodec

Kani-TTS-2 follows the โ€˜Audio-as-Languageโ€˜ philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts raw audio into discrete tokens using a neural codec.

The system relies on a two-stage process:

  1. The Language Backbone: The model is built on LiquidAIโ€™s LFM2 (350M) architecture. This backbone generates โ€˜audio intentโ€™ by predicting the next audio tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a faster alternative to standard transformers.
  2. The Neural Codec: It uses the NVIDIA NanoCodec to turn those tokens into 22kHz waveforms.

By using this architecture, the model captures human-like prosodyโ€”the rhythm and intonation of speechโ€”without the โ€˜roboticโ€™ artifacts found in older TTS systems.

Efficiency: 10,000 Hours in 6 Hours

The training metrics for Kani-TTS-2 are a masterclass in optimization. The English model was trained on 10,000 hours of high-quality speech data.

While that scale is impressive, the speed of training is the real story. The research team trained the model in only 6 hours using a cluster of 8 NVIDIA H100 GPUs. This proves that massive datasets no longer require weeks of compute time when paired with efficient architectures like LFM2.

Zero-Shot Voice Cloning and Performance

The standout feature for developers is zero-shot voice cloning. Unlike traditional models that require fine-tuning for new voices, Kani-TTS-2 uses speaker embeddings.

  • How it works: You provide a short reference audio clip.
  • The result: The model extracts the unique characteristics of that voice and applies them to the generated text instantly.

From a deployment perspective, the model is highly accessible:

  • Parameter Count: 400M (0.4B) parameters.
  • Speed: It features a Real-Time Factor (RTF) of 0.2. This means it can generate 10 seconds of speech in roughly 2 seconds.
  • Hardware: It requires only 3GB of VRAM, making it compatible with consumer-grade GPUs like the RTX 3060 or 4050.
  • License: Released under the Apache 2.0 license, allowing for commercial use.

Key Takeaways

  • Efficient Architecture: The model uses a 400M parameter backbone based on LiquidAIโ€™s LFM2 (350M). This โ€˜Audio-as-Languageโ€™ approach treats speech as discrete tokens, allowing for faster processing and more human-like intonation compared to traditional architectures.
  • Rapid Training at Scale: Kani-TTS-2-EN was trained on 10,000 hours of high-quality speech data in just 6 hours using 8 NVIDIA H100 GPUs.
  • Instant Zero-Shot Cloning: There is no need for fine-tuning to replicate a specific voice. By providing a short reference audio clip, the model uses speaker embeddings to instantly synthesize text in the target speakerโ€™s voice.
  • High Performance on Edge Hardware: With a Real-Time Factor (RTF) of 0.2, the model can generate 10 seconds of audio in approximately 2 seconds. It requires only 3GB of VRAM, making it fully functional on consumer-grade GPUs like the RTX 3060.
  • Developer-Friendly Licensing: Released under the Apache 2.0 license, Kani-TTS-2 is ready for commercial integration. It offers a local-first, low-latency alternative to expensive closed-source TTS APIs.

Check out theย Model Weight.ย Also,ย feel free to follow us onย Twitterย and donโ€™t forget to join ourย 100k+ ML SubRedditย and Subscribe toย our Newsletter. Wait! are you on telegram?ย now you can join us on telegram as well.

The post Meet โ€˜Kani-TTS-2โ€™: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support appeared first on MarkTechPost.

Similar Posts