StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio

The StepFun AI group has launched Step-Audio 2 Mini, an 8B parameter speech-to-speech giant audio language mannequin (LALM) that delivers expressive, grounded, and real-time audio interplay. Launched beneath the Apache 2.0 license, this open-source mannequin achieves state-of-the-art efficiency throughout speech recognition, audio understanding, and speech dialog benchmarks—surpassing business techniques similar to GPT-4o-Audio.

Key Options
1. Unified Audio–Textual content Tokenization
In contrast to cascaded ASR+LLM+TTS pipelines, Step-Audio 2 integrates Multimodal Discrete Token Modeling, the place textual content and audio tokens share a single modeling stream.
This allows:
- Seamless reasoning throughout textual content and audio.
- On-the-fly voice fashion switching throughout inference.
- Consistency in semantic, prosodic, and emotional outputs.
2. Expressive and Emotion-Conscious Technology
The mannequin doesn’t simply transcribe speech—it interprets paralinguistic options like pitch, rhythm, emotion, timbre, and elegance. This enables conversations with life like emotional tones similar to whispering, disappointment, or pleasure. Benchmarks on StepEval-Audio-Paralinguistic present Step-Audio 2 reaching 83.1% accuracy, far past GPT-4o Audio (43.5%) and Qwen-Omni (44.2%).
3. Retrieval-Augmented Speech Technology
Step-Audio 2 incorporates multimodal RAG (Retrieval-Augmented Technology):
- Net search integration for factual grounding.
- Audio search—a novel functionality that retrieves actual voices from a big library and fuses them into responses, enabling voice timbre/fashion imitation at inference time.
4. Instrument Calling and Multimodal Reasoning
The system extends past speech synthesis by supporting device invocation. Benchmarks present that Step-Audio 2 matches textual LLMs in device choice and parameter accuracy, whereas uniquely excelling at audio search device calls—a functionality unavailable in text-only LLMs.
Coaching and Knowledge Scale
- Textual content + Audio Corpus: 1.356T tokens
- Audio Hours: 8M+ actual and artificial hours
- Speaker Variety: ~50K voices throughout languages and dialects
- Pretraining Pipeline: multi-stage curriculum masking ASR, TTS, speech-to-speech translation, and emotion-labeled conversational synthesis.
This massive-scale coaching permits Step-Audio 2 Mini to retain robust textual content reasoning (by way of its Qwen2-Audio and CosyVoice basis) whereas mastering fine-grained audio modeling.
Efficiency Benchmarks


Computerized Speech Recognition (ASR)
- English: Common WER 3.14% (beats GPT-4o Transcribe at a median 4.5%).
- Chinese language: Common CER 3.08% (considerably decrease than GPT-4o and Qwen-Omni).
- Sturdy throughout dialects and accents.
Audio Understanding (MMAU Benchmark)
- Step-Audio 2: 78.0 common, outperforming Omni-R1 (77.0) and Audio Flamingo 3 (73.1).
- Strongest in sound and speech reasoning duties.
Speech Translation
- CoVoST 2 (S2TT): BLEU 39.26 (highest amongst open and closed fashions).
- CVSS (S2ST): BLEU 30.87, forward of GPT-4o (23.68).
Conversational Benchmarks (URO-Bench)
- Chinese language Conversations: Greatest general at 83.3 (primary) and 68.2 (professional).
- English Conversations: Aggressive with GPT-4o (83.9 vs. 84.5), far forward of different open fashions.
Conclusion
Step-Audio 2 Mini makes superior, multimodal speech intelligence accessible to the builders and analysis group. By combining Qwen2-Audio’s reasoning capability with CosyVoice’s tokenization pipeline, and augmenting with retrieval-based grounding, StepFun has delivered some of the succesful open audio LLMs.
Take a look at the PAPER and MODEL on HUGGING FACE. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.
The submit StepFun AI Releases Step-Audio 2 Mini: An Open-Source 8B Speech-to-Speech AI Model that Surpasses GPT-4o-Audio appeared first on MarkTechPost.