Artificial Intelligence | Audio Language Model

StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension

ByRicardo May 24, 2026May 24, 2026

StepFun, the Shanghai-based AI lab, launched StepAudio 2.5 Realtime. It is an end-to-end real-time speech massive language mannequin with absolutely customizable persona capabilities.

StepAudio 2.5 Realtime is a voice mannequin that operates in actual time. Unlike pipeline-based methods that separate speech recognition, reasoning, and synthesis into sequential steps, that is an end-to-end mannequin. Audio goes in and audio comes out by a single unified system. The mannequin helps Chinese and English.

It connects by way of a WebSocket API. The endpoint is wss://api.stepfun.com/v1/realtime utilizing the mannequin string step-2.5-realtime.

The Three Technical Pillars

StepFun analysis staff describes three core architectural improvements behind the mannequin:

1. Million-Scale Persona Data Augmentation

Starting from 10,000+ high-quality natively authored personas, StepFun utilized algorithmic augmentation to construct a million-scale persona characteristic matrix. This was mixed with hundreds of thousands of real-world conversational samples for coaching. The intent is generalization — particularly, steady efficiency on tough, long-tail conversational subjects.

Instead of manually labeling hundreds of thousands of persona samples, StepFun staff used algorithmic enlargement from a curated seed set.

2. Roleplay-Specific RLHF Alignment

A recognized failure mode in conversational AI is “out-of-character” (OOC) habits — when a mannequin drifts away from its outlined persona mid-conversation. StepFun staff carried out devoted RLHF (Reinforcement Learning from Human Feedback) optimization particularly for persona consistency in roleplay eventualities. RLHF is a coaching method the place human desire indicators are used to coach a reward mannequin, which then guides language mannequin habits. Applying it particularly to roleplay stability is a focused design alternative.

3. Unified Speech Understanding and Generation

StepAudio 2.5 Realtime inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and era by reinforcement studying. This allows what StepFun calls “international scene-level tonal setting” and “intra-sentence element sculpting.” The mannequin can set an total emotional register for a response whereas adjusting finer acoustic particulars inside particular person sentences.

Paralinguistic Understanding

A technically distinct space of this mannequin is paralinguistic notion. Paralinguistics refers to non-verbal acoustic info in speech — issues like tone, talking fee, pauses, sighs, and laughter. By analyzing these components, the mannequin can understand the consumer’s temper and underlying intentions. For instance, it could possibly establish fatigue from a low tone or frustration from a fast speech fee. Capturing these indicators requires the mannequin to function on audio options moderately than transcribed textual content alone.

StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating notion of vocal pace, emotion, age, and different acoustic options.

https://stepaudiollm.github.io/step-audio-2.5-realtime/

Benchmark Results

StepFun analysis staff carried out a complete suite of subjective and goal evaluations, benchmarking StepAudio 2.5 Realtime towards main real-time voice fashions throughout 5 dimensions.

Human analysis is carried out by actual cellular app conversations scored by human raters. The scores:

Human analysis (subjective): 80.41
General dialogue (goal): 86.36
Automotive state of affairs (goal): 84.80
Spoken QA, protecting 11 audio understanding duties (goal): 79.80
Paralinguistic comprehension (goal): 82.18

Key Takeaways

StepAudio 2.5 Realtime is an end-to-end real-time speech LLM, launched by Shanghai-based StepFun.
It makes use of persona-specific RLHF and million-scale knowledge augmentation to keep up steady character consistency.
The mannequin ranked first throughout all 5 benchmark dimensions, examined in April 2026.
Paralinguistic comprehension — perceiving tone, fee, emotion from audio — is a core technical differentiator.
API entry is by way of WebSocket at wss://api.stepfun.com/v1/realtime with mannequin string step-2.5-realtime.

Check out the Model Card and Demo. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension appeared first on MarkTechPost.

Artificial Intelligence Audio Language Model

TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New Industry Records in Accuracy, Speaker Labeling, Languages and Price
ByRicardo September 11, 2025

TwinMind, a California-based Voice AI startup, unveiled Ear-3 speech-recognition mannequin, claiming state-of-the-art efficiency on a number of key metrics and expanded multilingual help. The launch positions Ear-3 as a aggressive providing towards present ASR (Automatic Speech Recognition) options from suppliers like Deepgram, AssemblyAI, Eleven Labs, Otter, Speechmatics, and OpenAI. Key Metrics Metric TwinMind Ear-3 Result…

Read More TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New Industry Records in Accuracy, Speaker Labeling, Languages and Price
Artificial Intelligence Editors Pick

How to Build a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval with 64-Dimension Truncation
ByRicardo February 12, 2026

In this tutorial, we fine-tune a Sentence-Transformers embedding model using Matryoshka Representation Learning so that the earliest dimensions of the vector carry the most useful semantic signal. We train with MatryoshkaLoss on triplet data and then validate the key promise of MRL by benchmarking retrieval quality after truncating embeddings to 64, 128, and 256 dimensions….

Read More How to Build a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval with 64-Dimension Truncation
Applications Artificial Intelligence

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System
ByRicardo May 28, 2026

In this tutorial, we construct a full pgvector playground inside Google Colab and discover how PostgreSQL can work as a highly effective vector database for contemporary AI purposes. We begin by putting in PostgreSQL, compiling the pgvector extension, connecting via Psycopg, and registering vector varieties for clean Python integration. Then, we create embeddings with SentenceTransformers,…

Read More A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System
AI & Big Data Expo Artificial Intelligence

Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’
ByRicardo August 8, 2025

The world is in a race to deploy AI, but a leading voice in technology ethics warns prioritising speed over safety risks a “trust crisis.” Suvianna Grecu, Founder of the AI for Change Foundation, argues that without immediate and strong governance, we are on a path to “automating harm at scale.” Speaking on the integration…

Read More Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’
Artificial Intelligence Companies

Tencent improves testing creative AI models with new benchmark
ByRicardo July 9, 2025

Tencent has introduced a new benchmark, ArtifactsBench, that aims to fix current problems with testing creative AI models. Ever asked an AI to build something like a simple webpage or a chart and received something that works but has a poor user experience? The buttons might be in the wrong place, the colours might clash,…

Read More Tencent improves testing creative AI models with new benchmark
Artificial Intelligence Machine Learning

LiDAR Data: A Comprehensive Guide to Annotation and AI Integration
ByRicardo June 16, 2025

This article covers introduction to LiDAR technology, the essentials of LiDAR data annotation, an overview of LiDAR data collection to its applications in AI models, and how Cogito Tech can help with LiDAR data annotation. What is LiDAR? LiDAR, or Light Detection and Ranging is a remote sensing technology that uses light in the form…

Read More LiDAR Data: A Comprehensive Guide to Annotation and AI Integration

StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension

The Three Technical Pillars

1. Million-Scale Persona Data Augmentation

2. Roleplay-Specific RLHF Alignment

3. Unified Speech Understanding and Generation

Paralinguistic Understanding

Benchmark Results

Key Takeaways

TwinMind Introduces Ear-3 Model: A New Voice AI Model that Sets New Industry Records in Accuracy, Speaker Labeling, Languages and Price

How to Build a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval with 64-Dimension Truncation

A Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System

Suvianna Grecu, AI for Change: Without rules, AI risks ‘trust crisis’

Tencent improves testing creative AI models with new benchmark

LiDAR Data: A Comprehensive Guide to Annotation and AI Integration

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

The Three Technical Pillars

1. Million-Scale Persona Data Augmentation

2. Roleplay-Specific RLHF Alignment

3. Unified Speech Understanding and Generation

Paralinguistic Understanding

Benchmark Results

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!