StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing

How can speech modifying change into as direct and controllable as merely rewriting a line of textual content? StepEnjoyable AI has open sourced Step-Audio-EditX, a 3B parameter LLM based mostly audio mannequin that turns expressive speech modifying right into a token degree textual content like operation, as an alternative of a waveform degree sign processing activity.

Why builders care about controllable TTS?

Most zero shot TTS methods copy emotion, model, accent, and timbre immediately from a brief reference audio. They can sound pure, however management is weak. Style prompts in textual content assist just for in area voices, and the cloned voice usually ignores the requested emotion or talking model.

Past work tries to disentangle elements with additional encoders, adversarial losses, or complicated architectures. Step-Audio-EditX retains a comparatively entangled illustration and as an alternative modifications the information and publish coaching goal. The mannequin learns management by seeing many pairs and triplets the place textual content is fastened, however one attribute modifications with a big margin.

Architecture, twin codebook tokenizer plus compact audio LLM

Step-Audio-EditX reuses the Step-Audio twin codebook tokenizer. Speech is mapped into two token streams, a linguistic stream at 16.7 Hz with a 1024 entry codebook, and a semantic stream at 25 Hz with a 4096 entry codebook. Tokens are interleaved with a 2 to three ratio. The tokenizer retains prosody and emotion data, so it isn’t absolutely disentangled.

On high of this tokenizer, the StepEnjoyable analysis workforce builds a 3B parameter audio LLM. The mannequin is initialized from a textual content LLM, then skilled on a blended corpus with a 1 to 1 ratio of pure textual content and twin codebook audio tokens in chat model prompts. The audio LLM reads textual content tokens, audio tokens, or each, and all the time generates twin codebook audio tokens as output.

A separate audio decoder handles reconstruction. A diffusion transformer based mostly circulation matching module predicts Mel spectrograms from audio tokens, reference audio, and a speaker embedding, and a BigVGANv2 vocoder converts Mel spectrograms to waveform. The circulation matching module is skilled on about 200000 hours of top quality speech, which improves pronunciation and timbre similarity.

Large margin artificial knowledge as an alternative of difficult encoders

The key thought is massive margin studying. The mannequin is publish skilled on triplets and quadruplets that maintain textual content fastened and change just one attribute with a transparent hole.

For zero shot TTS, Step-Audio-EditX makes use of a top quality in home dataset, primarily Chinese and English, with a small quantity of Cantonese and Sichuanese, and about 60000 audio system. The knowledge covers huge intra speaker and inter speaker variation in model and emotion.(arXiv)

For emotion and talking model modifying, the workforce builds artificial massive margin triplets (textual content, audio impartial, audio emotion or model). Voice actors report about 10 second clips for every emotion and model. StepTTS zero shot cloning then produces impartial and emotional variations for a similar textual content and speaker. A margin scoring mannequin, skilled on a small human labeled set, scores pairs on a 1 to 10 scale, and solely samples with rating at least 6 are stored.

Paralinguistic modifying, which covers respiration, laughter, crammed pauses and different tags, makes use of a semi artificial technique on high of the NVSpeech dataset. The analysis workforce builds quadruplets the place the goal is the unique NVSpeech audio and transcript, and the enter is a cloned model with tags faraway from the textual content. This offers time area modifying supervision and not using a margin mannequin.

Reinforcement studying knowledge makes use of two choice sources. Human annotators charge 20 candidates per immediate on a 5 level scale for correctness, prosody, and naturalness, and pairs with margin better than 3 are stored. A comprehension mannequin scores emotion and talking model on a 1 to 10 scale, and pairs with margin better than 8 are stored.

Post coaching, SFT plus PPO on token sequences

Post coaching has two levels, supervised advantageous tuning adopted by PPO.

In supervised advantageous tuning, system prompts outline zero shot TTS and modifying duties in a unified chat format. For TTS, the immediate waveform is encoded to twin codebook tokens, transformed to string kind, and inserted into the system immediate as speaker data. The consumer message is the goal textual content, and the mannequin returns new audio tokens. For modifying, the consumer message contains authentic audio tokens plus a pure language instruction, and the mannequin outputs edited tokens.

Reinforcement studying then refines instruction following. A 3B reward mannequin is initialized from the SFT checkpoint and skilled with Bradley Terry loss on massive margin choice pairs. The reward is computed immediately on twin codebook token sequences, with out decoding to waveform. PPO coaching makes use of this reward mannequin, a clip threshold, and a KL penalty to steadiness high quality and deviation from the SFT coverage.

Step-Audio-Edit-Test, iterative modifying and generalization

To quantify management, the analysis workforce launched Step-Audio-Edit-Test. It makes use of Gemini 2.5 Pro as an LLM as a decide to judge emotion, talking model, and paralinguistic accuracy. The benchmark has 8 audio system, drawn from Wenet Speech4TTS, GLOBE V2, and Libri Light, with 4 audio system per language.

The emotion set has 5 classes with 50 Chinese and 50 English prompts per class. The talking model set has 7 types with 50 prompts per language per model. The paralinguistic set has 10 labels similar to respiration, laughter, shock oh, and uhm, with 50 prompts per label and language.

Editing is evaluated iteratively. Iteration 0 is the preliminary zero shot clone. Then the mannequin applies 3 rounds of modifying with textual content directions. In Chinese, emotion accuracy rises from 57.0 at iteration 0 to 77.7 at iteration 3. Speaking model accuracy rises from 41.6 to 69.2. English reveals related habits, and a immediate fastened ablation, the place the identical immediate audio is used for all iterations, nonetheless improves accuracy, which helps the big margin studying speculation.

The identical modifying mannequin is utilized to 4 closed supply TTS methods, GPT 4o mini TTS, ElevenLabs v2, Doubao Seed TTS 2.0, and MiniMax speech 2.6 hd. For all of them, one modifying iteration with Step-Audio-EditX improves each emotion and model accuracy, and additional iterations proceed to assist.

Paralinguistic modifying is scored on a 1 to three scale. The common rating rises from 1.91 at iteration 0 to 2.89 after a single edit, in each Chinese and English, which is akin to native paralinguistic synthesis in sturdy industrial methods.

Key Takeaways

Step Audio EditX makes use of a twin codebook tokenizer and a 3B parameter audio LLM so it could deal with speech as discrete tokens and edit audio in a textual content like method.
The mannequin depends on massive margin artificial knowledge for emotion, talking model, paralinguistic cues, velocity, and noise, relatively than including additional disentangling encoders.
Supervised advantageous tuning plus PPO with a token degree reward mannequin aligns the audio LLM to observe pure language modifying directions for each TTS and modifying duties.
The Step Audio Edit Test benchmark with Gemini 2.5 Pro as a decide reveals clear accuracy features over 3 modifying iterations for emotion, model, and paralinguistic management in each Chinese and English.
Step Audio EditX can publish course of and enhance speech from closed supply TTS methods, and the complete stack, together with code and checkpoints, is obtainable as open supply for builders.

Editorial Comments

Step Audio EditX is a exact step ahead in controllable speech synthesis, as a result of it retains the Step Audio tokenizer, provides a compact 3B audio LLM, and optimizes management by way of massive margin knowledge and PPO. The introduction of Step Audio Edit Test with Gemini 2.5 Pro as a decide makes the analysis story concrete for emotion, talking model, and paralinguistic management, and the open launch lowers the barrier for sensible audio modifying analysis. Overall, this launch makes audio modifying really feel a lot nearer to textual content modifying.

Check out the Paper, Repo and Model Weights. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.