Meet ‘Kani-TTS-2’: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support

ByRicardo February 17, 2026

The landscape of generative audio is shifting toward efficiency. A new open-source contender, Kani-TTS-2, has been released by the team at nineninesix.ai. This model marks a departure from heavy, compute-expensive TTS systems. Instead, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.

Kani-TTS-2 offers a lean, high-performance alternative to closed-source APIs. It is currently available on Hugging Face in both English (EN) and Portuguese (PT) versions.

The Architecture: LFM2 and NanoCodec

Kani-TTS-2 follows the ‘Audio-as-Language‘ philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts raw audio into discrete tokens using a neural codec.

The system relies on a two-stage process:

The Language Backbone: The model is built on LiquidAI’s LFM2 (350M) architecture. This backbone generates ‘audio intent’ by predicting the next audio tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a faster alternative to standard transformers.
The Neural Codec: It uses the NVIDIA NanoCodec to turn those tokens into 22kHz waveforms.

By using this architecture, the model captures human-like prosody—the rhythm and intonation of speech—without the ‘robotic’ artifacts found in older TTS systems.

Efficiency: 10,000 Hours in 6 Hours

The training metrics for Kani-TTS-2 are a masterclass in optimization. The English model was trained on 10,000 hours of high-quality speech data.

While that scale is impressive, the speed of training is the real story. The research team trained the model in only 6 hours using a cluster of 8 NVIDIA H100 GPUs. This proves that massive datasets no longer require weeks of compute time when paired with efficient architectures like LFM2.

Zero-Shot Voice Cloning and Performance

The standout feature for developers is zero-shot voice cloning. Unlike traditional models that require fine-tuning for new voices, Kani-TTS-2 uses speaker embeddings.

How it works: You provide a short reference audio clip.
The result: The model extracts the unique characteristics of that voice and applies them to the generated text instantly.

From a deployment perspective, the model is highly accessible:

Parameter Count: 400M (0.4B) parameters.
Speed: It features a Real-Time Factor (RTF) of 0.2. This means it can generate 10 seconds of speech in roughly 2 seconds.
Hardware: It requires only 3GB of VRAM, making it compatible with consumer-grade GPUs like the RTX 3060 or 4050.
License: Released under the Apache 2.0 license, allowing for commercial use.

Key Takeaways

Efficient Architecture: The model uses a 400M parameter backbone based on LiquidAI’s LFM2 (350M). This ‘Audio-as-Language’ approach treats speech as discrete tokens, allowing for faster processing and more human-like intonation compared to traditional architectures.
Rapid Training at Scale: Kani-TTS-2-EN was trained on 10,000 hours of high-quality speech data in just 6 hours using 8 NVIDIA H100 GPUs.
Instant Zero-Shot Cloning: There is no need for fine-tuning to replicate a specific voice. By providing a short reference audio clip, the model uses speaker embeddings to instantly synthesize text in the target speaker’s voice.
High Performance on Edge Hardware: With a Real-Time Factor (RTF) of 0.2, the model can generate 10 seconds of audio in approximately 2 seconds. It requires only 3GB of VRAM, making it fully functional on consumer-grade GPUs like the RTX 3060.
Developer-Friendly Licensing: Released under the Apache 2.0 license, Kani-TTS-2 is ready for commercial integration. It offers a local-first, low-latency alternative to expensive closed-source TTS APIs.

Check out the Model Weight. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Meet ‘Kani-TTS-2’: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support appeared first on MarkTechPost.

Agentic AI AI Agents

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
ByRicardo January 23, 2026

Chroma 1.0 is a real time speech to speech dialogue model that takes audio as input and returns audio as output while preserving the speaker identity across multi turn conversations. It is presented as the first open source end to end spoken dialogue system that combines low latency interaction with high fidelity personalized voice cloning…

Read More FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
Agentic AI Artificial Intelligence

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
ByRicardo March 17, 2026

Residual connections are one of many least questioned components of recent Transformer design. In PreNorm architectures, every layer provides its output again right into a operating hidden state, which retains optimization steady and permits deep fashions to prepare. Moonshot AI researchers argue that this commonplace mechanism additionally introduces a structural downside: all prior layer outputs…

Read More Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Agentic AI AI Agents

Understanding the Universal Tool Calling Protocol (UTCP)
ByRicardo September 22, 2025

The Universal Tool Calling Protocol (UTCP) is a light-weight, safe, and scalable method for AI brokers and functions to seek out and name instruments immediately, with out the want for extra wrapper servers. Key Features Lightweight and safe – Allows instruments to be accessed immediately, avoiding pointless center layers. Scalable – Can help numerous instruments…

Read More Understanding the Universal Tool Calling Protocol (UTCP)
Artificial Intelligence Sponsored Content

Yext Unveils Scout and Launches Webinar to Help Brands Stay Visible in AI & Local Search
ByRicardo August 20, 2025August 20, 2025

In March Yext, the main model visibility platform, launched Yext Scout, an AI search and aggressive intelligence agent designed to provide manufacturers visibility and actionable insights throughout each conventional and AI-driven search platforms. Built-in inside the Yext platform, Scout offers insights into visibility throughout conventional and AI search platforms, benchmarks efficiency in opposition to opponents,…

Read More Yext Unveils Scout and Launches Webinar to Help Brands Stay Visible in AI & Local Search
Agentic AI AI Shorts

A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks
ByRicardo November 2, 2025

In this tutorial, we develop a complete benchmarking framework to consider varied sorts of agentic AI programs on real-world enterprise software program duties. We design a suite of numerous challenges, from knowledge transformation and API integration to workflow automation and efficiency optimization, and assess how varied brokers, together with rule-based, LLM-powered, and hybrid ones, carry…

Read More A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks
Agentic AI AI Agents

Building an Advanced PaperQA2 Research Agent with Google Gemini for Scientific Literature Analysis
ByRicardo August 9, 2025

In this tutorial, we walk through building an advanced PaperQA2 AI Agent powered by Google’s Gemini model, designed specifically for scientific literature analysis. We set up the environment in Google Colab/Notebook, configure the Gemini API, and integrate it seamlessly with PaperQA2 to process and query multiple research papers. By the end of the setup, we…

Read More Building an Advanced PaperQA2 Research Agent with Google Gemini for Scientific Literature Analysis

Meet ‘Kani-TTS-2’: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support

The Architecture: LFM2 and NanoCodec

Efficiency: 10,000 Hours in 6 Hours

Zero-Shot Voice Cloning and Performance

Key Takeaways

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Understanding the Universal Tool Calling Protocol (UTCP)

Yext Unveils Scout and Launches Webinar to Help Brands Stay Visible in AI & Local Search

A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks

Building an Advanced PaperQA2 Research Agent with Google Gemini for Scientific Literature Analysis

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

The Architecture: LFM2 and NanoCodec

Efficiency: 10,000 Hours in 6 Hours

Zero-Shot Voice Cloning and Performance

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!