StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension
StepFun, the Shanghai-based AI lab, launched StepAudio 2.5 Realtime. It is an end-to-end real-time speech massive language mannequin with absolutely customizable persona capabilities.
StepAudio 2.5 Realtime is a voice mannequin that operates in actual time. Unlike pipeline-based methods that separate speech recognition, reasoning, and synthesis into sequential steps, that is an end-to-end mannequin. Audio goes in and audio comes out by a single unified system. The mannequin helps Chinese and English.
It connects by way of a WebSocket API. The endpoint is wss://api.stepfun.com/v1/realtime utilizing the mannequin string step-2.5-realtime.
The Three Technical Pillars
StepFun analysis staff describes three core architectural improvements behind the mannequin:
1. Million-Scale Persona Data Augmentation
Starting from 10,000+ high-quality natively authored personas, StepFun utilized algorithmic augmentation to construct a million-scale persona characteristic matrix. This was mixed with hundreds of thousands of real-world conversational samples for coaching. The intent is generalization — particularly, steady efficiency on tough, long-tail conversational subjects.
Instead of manually labeling hundreds of thousands of persona samples, StepFun staff used algorithmic enlargement from a curated seed set.
2. Roleplay-Specific RLHF Alignment
A recognized failure mode in conversational AI is “out-of-character” (OOC) habits — when a mannequin drifts away from its outlined persona mid-conversation. StepFun staff carried out devoted RLHF (Reinforcement Learning from Human Feedback) optimization particularly for persona consistency in roleplay eventualities. RLHF is a coaching method the place human desire indicators are used to coach a reward mannequin, which then guides language mannequin habits. Applying it particularly to roleplay stability is a focused design alternative.
3. Unified Speech Understanding and Generation
StepAudio 2.5 Realtime inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and era by reinforcement studying. This allows what StepFun calls “international scene-level tonal setting” and “intra-sentence element sculpting.” The mannequin can set an total emotional register for a response whereas adjusting finer acoustic particulars inside particular person sentences.
Paralinguistic Understanding
A technically distinct space of this mannequin is paralinguistic notion. Paralinguistics refers to non-verbal acoustic info in speech — issues like tone, talking fee, pauses, sighs, and laughter. By analyzing these components, the mannequin can understand the consumer’s temper and underlying intentions. For instance, it could possibly establish fatigue from a low tone or frustration from a fast speech fee. Capturing these indicators requires the mannequin to function on audio options moderately than transcribed textual content alone.
StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating notion of vocal pace, emotion, age, and different acoustic options.

Benchmark Results
StepFun analysis staff carried out a complete suite of subjective and goal evaluations, benchmarking StepAudio 2.5 Realtime towards main real-time voice fashions throughout 5 dimensions.
Human analysis is carried out by actual cellular app conversations scored by human raters. The scores:
- Human analysis (subjective): 80.41
- General dialogue (goal): 86.36
- Automotive state of affairs (goal): 84.80
- Spoken QA, protecting 11 audio understanding duties (goal): 79.80
- Paralinguistic comprehension (goal): 82.18
Key Takeaways
- StepAudio 2.5 Realtime is an end-to-end real-time speech LLM, launched by Shanghai-based StepFun.
- It makes use of persona-specific RLHF and million-scale knowledge augmentation to keep up steady character consistency.
- The mannequin ranked first throughout all 5 benchmark dimensions, examined in April 2026.
- Paralinguistic comprehension — perceiving tone, fee, emotion from audio — is a core technical differentiator.
- API entry is by way of WebSocket at
wss://api.stepfun.com/v1/realtimewith mannequin stringstep-2.5-realtime.
Check out the Model Card and Demo. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension appeared first on MarkTechPost.
