Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion
The landscape of Text-to-Speech (TTS) is moving away from modular pipelines toward integrated Large Audio Models (LAMs). Fish Audio’s release of S2-Pro, the flagship model within the Fish Speech ecosystem, represents a shift toward open architectures capable of high-fidelity, multi-speaker synthesis with sub-150ms latency. The release provides a framework for zero-shot voice cloning and granular emotional control using a Dual-Auto-Regressive (AR) approach.
Architecture: The Dual-AR Framework and RVQ
The fundamental technical distinction in Fish Audio S2-Pro is its hierarchical Dual-AR architecture. Traditional TTS models often struggle with the trade-off between sequence length and acoustic detail. S2-Pro addresses this by bifurcating the generation process into two specialized stages: a ‘Slow AR’ model and a ‘Fast AR’ model.
- The Slow AR Model (4B Parameters): This component operates on the time-axis. It is responsible for processing linguistic input and generating semantic tokens. By utilizing a larger parameter count (approximately 4 billion), the Slow AR model captures long-range dependencies, prosody, and the structural nuances of speech.
- The Fast AR Model (400M Parameters): This component processes the acoustic dimension. It predicts the residual codebooks for each semantic token. This smaller, faster model ensures that the high-frequency details of the audio—timbre, breathiness, and texture—are generated with high efficiency.
This system relies on Residual Vector Quantization (RVQ). In this setup, raw audio is compressed into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, while subsequent layers capture the ‘residuals’ or the remaining errors from the previous layer. This allows the model to reconstruct high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer architecture.
Emotional Control via In-Context Learning and Inline Tags
Fish Audio S2-Pro achieves what the developers describe as ‘absurdly controllable emotion’ through two primary mechanisms: zero-shot in-context learning and natural language inline control.
In-Context Learning (ICL):
Unlike older generations of TTS that required explicit fine-tuning to mimic a specific voice, S2-Pro utilizes the Transformer’s ability to perform in-context learning. By providing a reference audio clip—ideally between 10 and 30 seconds—the model extracts the speaker’s identity and emotional state. The model treats this reference as a prefix in its context window, allowing it to continue the “sequence” in the same voice and style.
Inline Control Tags:
The model supports dynamic emotional transitions within a single generation pass. Because the model was trained on data containing descriptive linguistic markers, developers can insert natural language tags directly into the text prompt. For example:
[whisper] I have a secret [laugh] that I cannot tell you.
The model interprets these tags as instructions to modify the acoustic tokens in real-time, adjusting pitch, intensity, and rhythm without requiring a separate emotional embedding or external control vector.
Performance Benchmarks and SGLang Integration
Integrating TTS into real-time applications, the primary constraint is ‘Time to First Audio’ (TTFA). Fish Audio S2-Pro is optimized for a sub-150ms latency, with benchmarks on NVIDIA H200 hardware reaching approximately 100ms.
Several technical optimizations contribute to this performance:
- SGLang and RadixAttention: S2-Pro is designed to work with SGLang, a high-performance serving framework. It utilizes RadixAttention, which allows for efficient Key-Value (KV) cache management. In a production environment where the same “master” voice prompt (reference clip) is used repeatedly, RadixAttention caches the prefix’s KV states. This eliminates the need to re-compute the reference audio for every request, significantly reducing the prefill time.
- Multi-Speaker Single-Pass Generation: The architecture allows for multiple speaker identities to be present within the same context window. This permits the generation of complex dialogues or multi-character narrations in a single inference call, avoiding the latency overhead of switching models or reloading weights for different speakers.
Technical Implementation and Data Scaling
The Fish Speech repository provides a Python-based implementation utilizing PyTorch. The model was trained on a diverse dataset comprising over 300,000 hours of multi-lingual audio. This scale is what enables the model’s robust performance across different languages and its ability to handle ‘non-verbal’ vocalizations like sighs or hesitations.
The training pipeline involves:
- VQ-GAN Training: Training the quantizer to map audio into a discrete latent space.
- LLM Training: Training the Dual-AR transformers to predict those latent tokens based on text and acoustic prefixes.
The VQ-GAN used in S2-Pro is specifically tuned to minimize artifacts during the decoding process, ensuring that even at high compression ratios, the reconstructed audio remains ‘transparent’ (indistinguishable from the source to the human ear).
Key Takeaways
- Dual-AR Architecture (Slow/Fast): Unlike single-stage models, S2-Pro splits tasks between a 4B parameter ‘Slow AR’ model (for linguistic and prosodic structure) and a 400M parameter ‘Fast AR’ model (for acoustic refinement), optimizing both detail and speed.
- Sub-150ms Latency: Engineered for real-time conversational AI, the model achieves a Time-to-First-Audio (TTFA) of ~100ms on high-end hardware, making it suitable for live agents and interactive applications.
- Hierarchical RVQ Encoding: By using Residual Vector Quantization, the system compresses 44.1kHz audio into discrete tokens across multiple layers. This allows the model to reconstruct complex vocal textures—including breaths and sighs—without the computational bloat of raw waveforms.
- Zero-Shot In-Context Learning: Developers can clone a voice and its emotional state by providing a 10–30 second reference clip. The model treats this as a prefix, adopting the speaker’s timbre and prosody without requiring additional fine-tuning.
- RadixAttention & SGLang Integration: Optimized for production, S2-Pro leverages RadixAttention to cache KV states of voice prompts. This allows for nearly instant generation when using the same speaker repeatedly, drastically reducing prefill overhead.
Check out Model Card and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The post Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion appeared first on MarkTechPost.
