Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing
Stability AI has launched open weights for Stable Audio 3 together with a technical research paper. Stable Audio 3 is a household of latent diffusion fashions that generate stereo audio at 44.1 okHz. The fashions help variable-length outputs, inpainting-based enhancing, and quick inference.
What Is Stable Audio 3?
Stable Audio 3 is a household of three mannequin scales: small, medium, and giant. A latent diffusion mannequin generates audio by studying to progressively take away noise from a compressed illustration of audio, known as a latent. The mannequin learns a mapping from noise to information by coaching on many (noisy latent, audio) pairs.
The three mannequin scales differ in capability and most era size. All parameter counts under are for the diffusion transformer element solely. Each mannequin additionally features a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L).
- small-music — 459M diffusion transformer parameters, as much as 2 minutes, music solely.
- small-sfx — 459M diffusion transformer parameters, as much as 2 minutes, sound results solely.
- medium — 1.4B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.
- giant — 2.7B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.
Open weights for small and medium can be found on Hugging Face. Large is out there below an enterprise license.
Architecture: Two Components
Stable Audio 3 has two principal parts: a semantic-acoustic autoencoder known as SAME, and a diffusion transformer that generates latent sequences conditioned on textual content, period, and inpainting masks.

The SAME Autoencoder
SAME (Semantically-Aligned Music autoEncoder) converts stereo 44.1 okHz audio right into a compact latent illustration and again. Its key design parameter is a 4096× downsampling ratio — considerably larger than the 1024× to 2048× ratios widespread in prior audio autoencoders. This larger ratio reduces latent sequence lengths sufficient for long-form era to run on shopper {hardware}.
SAME achieves its 4096× compression by way of two phases. First, a patching stage reshapes stereo audio into non-overlapping patches of 256 samples per channel, attaining 256× downsampling. Second, a Transformer Resampling Block (TRB) applies an additional 16× downsampling utilizing learnable output embeddings interleaved with the enter sequence, processed by way of a transformer. The mixed output is a 256-dimensional latent sequence at roughly 10.76 Hz for a 44.1 okHz enter.
The SAME autoencoder is educated with 5 loss varieties: spectral reconstruction, adversarial, diffusion alignment, semantic regression (predicting chroma and interaural stage distinction), and contrastive latent alignment. These losses push the latent to protect each acoustic reconstruction high quality and semantic construction. A soft-normalisation bottleneck constrains the dimensions of the latent, offering deterministic encoding.
The SAME autoencoder is frozen throughout diffusion coaching. Small fashions use SAME-S (108M parameters, optimized for CPU inference); medium and giant use SAME-L (852M parameters).
The Diffusion Transformer
The diffusion transformer operates on SAME latents. Conditioning enters by way of three pathways:
- Text — a frozen T5Gemma encoder produces a sequence of 256 embeddings of dimension 768. Short prompts are padded to 256 with a discovered embedding; lengthy prompts are truncated.
- Duration — encoded as a Fourier options vector and injected through each Adaptive Layer Normalization (AdaLN) and cross-attention alongside the textual content immediate.
- Inpainting — a binary masks concatenated with the masked reference audio is projected by way of a 2-layer MLP and added to the residual stream of every transformer block.
Each transformer block comprises self-attention, cross-attention, local-additive conditioning for inpainting, and a SwiGLU feed-forward community. Medium and giant use differential consideration, which computes two separate consideration maps utilizing two (Q, Ok) pairs sharing one set of values V, then subtracts one map from the opposite. This cancels consideration patterns which might be widespread to each heads. The transformer prepends 64 learnable reminiscence embeddings earlier than processing every sequence. These present a worldwide context buffer that each place can attend to, and are eliminated earlier than computing any loss.
Variable-Length Generation
Most prior latent diffusion fashions for audio function at a set most sequence size. Generating a brief clip nonetheless requires working inference at full size, losing compute on silence. Stable Audio 3 is educated to generate audio at variable lengths natively, utilizing three mechanisms:
- Variable-length flash consideration and masked loss — sequences shorter than the batch most are right-padded in latent area. Padding positions are excluded from self-attention and from the loss.
- Per-element timestep shifts — longer sequences retain extra construction at a given noise stage resulting from redundancy between neighboring components. To compensate, the noise schedule is shifted towards larger noise ranges for longer sequences throughout coaching, utilizing a logistic shift parameterized by µ (interpolating between µmin=0.5 and µmax=1.15 primarily based on sequence size).
- Silence augmentation — the sign area is randomly prolonged with pre-computed silence embeddings drawn from an exponential distribution, averaging 4 seconds. This teaches the mannequin to terminate audio with pure silence.
The sensible result’s that inference price scales with output period. Medium generates 20 seconds of audio in roughly 0.62 seconds on an H200. Generating 380 seconds takes 1.31 seconds on the identical {hardware}.
Three-Stage Training Pipeline
Stage 1 — Flow Matching Pre-Training. The mannequin learns a velocity area that transports Gaussian noise towards audio latents. Training makes use of minibatch optimum transport coupling through Sinkhorn iterations, which pairs every information pattern with the closest out there noise vector within the batch. This straightens coaching trajectories and reduces crossing transport paths. Inpainting is educated collectively all through: at every step, one of three masks varieties is sampled — full masks (80%, equal to unconditional era), random section masks (10%), or a causal prefix masks for continuation (10%).
Stage 2 — Distillation Warmup. A frozen copy of the stream matching mannequin (instructor) generates 15-step DPM++ trajectories with CFG scale 5. The scholar is educated for 10,000 steps to map any intermediate noisy state on to the instructor’s last denoised output in a single step, utilizing an MSE loss. This collapses the multi-step ODE right into a single-step denoiser. The trade-off is that MSE regression produces outputs that regress towards the conditional imply, decreasing fine-grained element.
Stage 3 — Adversarial Post-Training. This stage replaces the MSE goal with a relativistic adversarial setup. A discriminator (initialized from the bottom stream matching mannequin) evaluates the coed’s one-step denoised outputs immediately in opposition to actual information. The instructor is discarded solely at this stage. The generator is educated with two losses: a relativistic adversarial loss (L_R) and a CLAP alignment loss (L_CLAP). The discriminator is educated with L_R and a contrastive loss (L_C) that penalizes the discriminator for ignoring text-audio alignment (it’s educated to tell apart appropriately paired audio-text pairs from shuffled ones). The adversarial setup permits the mannequin to recuperate the perceptual sharpness that MSE distillation removes.
Inference: Ping-Pong Sampling and No CFG
The post-trained mannequin can generate audio in a single ahead go. However, single-step era from pure noise stays troublesome. Stable Audio 3 makes use of ping-pong sampling at inference: the mannequin denoises to a clear estimate, then provides new noise at a diminished stage, then denoises once more. This repeats for 8 steps utilizing a logSNR-uniform schedule (N+1 equally-spaced steps within the interval [λmin, λmax] = [−6.2, 2.0]). The iterative denoise-then-renoise schedule permits every step to right errors from the earlier step.
Stable Audio 3 does not require classifier-free steering (CFG) at inference. Standard diffusion fashions run two ahead passes per step — one conditional, one unconditional — and interpolate. Here, CFG high quality positive aspects are internalized throughout distillation warmup, the place the coed is educated to match CFG-enhanced instructor trajectories. Text-audio alignment is additional bolstered by way of L_CLAP throughout adversarial post-training. This eliminates the two-pass-per-step price of CFG.
Prompt formatting be aware: All Stable Audio 3 fashions educated on AudioSparx (small-music, medium, giant) require immediate prefixes to perform appropriately. Music prompts must be prepended with "TrackType: Music, VocalType: Instrumental," and sound results prompts with "TrackType: SFX,".
Evaluation Results
Instrumental music (Song Describer Dataset, 120s). On FAD (decrease is healthier) and CLAP rating (larger is healthier), giant achieves FAD 0.101 / CLAP 0.393. Medium achieves FAD 0.107 / CLAP 0.390. Stable Audio 2.5 (the inner prior-generation baseline) achieves FAD 0.106 / CLAP 0.395. In the listening check, medium and giant rating larger on musicality (MUS) than Stable Audio 2.5 (4.15 and 4.30 vs. 3.70 out of 5, respectively). Inference time for 120s audio on an H200: 0.45s for small, 0.78s for medium, 0.81s for giant. Stable Audio 2.5 takes 0.85s for the identical size.
Sound results (BBC Sound Effects Dataset, 5s). Medium achieves FAD 0.369 / CLAP 0.369. The next-best open-weight baselines are Stable Audio Open Small (FAD 0.500 / CLAP 0.277) and Stable Audio Open (FAD 0.501 / CLAP 0.263). Woosh Flow scores FAD 0.580.
Audio enhancing (inpainting). The analysis crew evaluates three inpainting settings: single area, two impartial areas, and continuation. For music, medium achieves FAD-full of 0.046 on single inpainting and 0.046 on double inpainting. Large achieves 0.047 on each. For continuation, medium achieves FAD-full 0.074 and giant achieves 0.071. Sound results outcomes observe an identical sample; continuation exhibits larger FAD than inpainting in each domains, which the crew attributes to the mannequin having much less surrounding audio context to anchor the era.
Comparison
| Model | Developer | Released | Architecture | Parameters | Max size | Sample charge | Domain | Open weights | Inpainting |
|---|---|---|---|---|---|---|---|---|---|
| STABLE AUDIO LINEAGE | |||||||||
| Stable Audio Open | Stability AI | Jul 2024 | Latent diffusion (DiT) | DiT 1057M + AE 156M + T5 109M | 47s | 44.1kHz stereo | Music + SFX | Yes | No |
| Stable Audio Open Small | Stability AI | 2024 | Latent diffusion (DiT) | Not printed | 11s | 44.1kHz stereo | SFX | Yes | No |
| Stable Audio 2.5 | Stability AI | Internal | Latent diffusion (DiT) | Not printed | 190s (3m 10s) | 44.1kHz stereo | Music | Not launched | No |
| SA3 small-music ★ | Stability AI | May 2026 | Latent diffusion (SAME + DiT) | DT 459M + SAME-S 108M | 2m | 44.1kHz stereo | Music solely | Yes | Yes |
| SA3 small-sfx ★ | Stability AI | May 2026 | Latent diffusion (SAME + DiT) | DT 459M + SAME-S 108M | 2m | 44.1kHz stereo | SFX solely | Yes | Yes |
| SA3 medium ★ | Stability AI | May 2026 | Latent diffusion (SAME + DiT) | DT 1.4B + SAME-L 852M | 6m 20s | 44.1kHz stereo | Music + SFX | Yes | Yes |
| SA3 giant ★ | Stability AI | May 2026 | Latent diffusion (SAME + DiT) | DT 2.7B + SAME-L 852M | 6m 20s | 44.1kHz stereo | Music + SFX | Enterprise | Yes |
| COMPETITORS | |||||||||
| TangoFlux | SUTD / NVIDIA / Lambda | Dec 2024 | Flow matching (DiT + MMDiT) | 515M | 30s | 44.1kHz | SFX | Yes (Apache 2.0) | No |
| Woosh Flow | Sony AI | Apr 2026 | Flow matching | Not printed | 5s | Not disclosed | SFX | Yes (MIT) | No |
| Woosh DFlow | Sony AI | Apr 2026 | Distilled stream matching | Not printed | 5s | Not disclosed | SFX | Yes (MIT) | No |
| DiffRhythm 2 | ASLP Lab (NPU) | Oct 2025 | Block stream matching (semi-autoregressive) | Not printed | 210s (3m 30s) | 48kHz output | Music + vocals | Yes | No |
| ACE-Step 1.5 | ACE Studio / StepEnjoyable | Jan 2026 | Hybrid LM (0.6B–4B) + DiT (as much as 4B) | LM 0.6B–4B + XL DiT 4B | 10m | Not disclosed | Music + vocals + lyrics | Yes | No |
★ SA3 rows: Parameter counts are for the diffusion transformer (DT) element solely; SAME autoencoder params are listed individually. Total mannequin dimension together with SAME: small ~567M, medium ~2.25B, giant ~3.55B.
Stable Audio 2.5 is an inside Stability AI mannequin not publicly launched; included as prior-generation inside baseline from the SA3 paper.
DiffRhythm 2 VAE processes 24kHz enter audio and reconstructs at 48kHz (arXiv:2510.22950).
Evaluation setup: Song Describer Dataset (SDD), 120s instrumental music generations, H200 GPU. FAD makes use of LAION-CLAP embeddings (630k-audioset-best.pt). OVL/REL/MUS are imply opinion scores (1–5) from a 14-participant listening check. Source: SA3 paper Tables 3 and 4. Bold + underline = finest rating in column.
| Model | FAD ↓ | CLAP ↑ | OVL ↑ (1–5) | REL ↑ (1–5) | MUS ↑ (1–5) | Inference (H200) | Sampler / steps |
|---|---|---|---|---|---|---|---|
| COMPETITORS | |||||||
| DiffRhythm 2 | 0.293 | 0.158 | 3.05 ± 0.94 | 2.10 ± 1.29 | 2.60 ± 1.10 | 3.88s | — |
| ACE-Step 1.5 xl-turbo | 0.193 | 0.321 | 3.35 ± 1.09 | 3.30 ± 1.13 | 3.15 ± 1.31 | 6.23s | — |
| STABILITY AI — PRIOR GENERATION | |||||||
| Stable Audio 2.5 (inside) | 0.106 | 0.395 | 3.90 ± 0.79 | 4.30 ± 0.66 | 3.70 ± 0.92 | 0.85s | DPM++ 3M SDE, 8 steps, CFG 6 |
| STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG) | |||||||
| SA3 small-music | 0.145 | 0.393 | 3.20 ± 0.89 | 3.60 ± 0.94 | 3.15 ± 0.81 | 0.45s | PingPong, 8 steps |
| SA3 medium | 0.107 | 0.390 | 4.20 ± 0.89 | 4.25 ± 0.85 | 4.15 ± 0.93 | 0.78s | PingPong, 8 steps |
| SA3 giant | 0.101 | 0.393 | 3.95 ± 0.89 | 3.80 ± 1.11 | 4.30 ± 0.73 | 0.81s | PingPong, 8 steps |
FAD: Fréchet Audio Distance — decrease is healthier. CLAP: cosine similarity between textual content and audio embeddings — larger is healthier.
OVL = general manufacturing high quality. REL = textual content relevance. MUS = musicality (melody/concord coherence).
ACE-Step 1.5 and DiffRhythm 2 evaluated with instrumental prompts solely for honest comparability with SA3 (instrumental-only fashions). SA3 base stream matching fashions (50 steps, CFG 7, Euler sampler) will not be proven right here; see SA3 paper Table 11 for that comparability.
Evaluation setup: BBC Sound Effects Dataset, ≤5s generations matched to reference period, H200 GPU. FAD makes use of LAION-CLAP embeddings. OVL/REL from 14-participant listening check. Source: SA3 paper Table 5. Bold + underline = finest rating in column.
| Model | FAD ↓ | CLAP ↑ | OVL ↑ (1–5) | REL ↑ (1–5) | Inference (H200) | Sampler / steps |
|---|---|---|---|---|---|---|
| COMPETITORS | ||||||
| TangoFlux | 0.760 | 0.179 | 2.35 ± 1.04 | 3.25 ± 1.37 | 1.90s | Flow matching, 50 steps, CFG 4.5 |
| Woosh DFlow | 0.619 | 0.228 | 3.10 ± 1.25 | 3.20 ± 1.64 | 0.06s | Distilled stream, 4 steps |
| Woosh Flow | 0.580 | 0.277 | 3.45 ± 1.19 | 3.80 ± 1.28 | 1.92s | Adaptive ODE (~72 steps avg) |
| STABILITY AI — PRIOR GENERATION | ||||||
| Stable Audio Open | 0.501 | 0.263 | 2.95 ± 1.32 | 3.30 ± 1.30 | 12.30s | DPM++ 3M SDE, 100 steps, CFG 7 |
| Stable Audio Open Small | 0.500 | 0.277 | 3.10 ± 1.12 | 3.55 ± 1.00 | 0.24s | PingPong, 8 steps |
| STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG) | ||||||
| SA3 small-sfx | 0.395 | 0.351 | 3.35 ± 1.39 | 3.25 ± 1.45 | 0.41s | PingPong, 8 steps |
| SA3 medium | 0.369 | 0.369 | 3.65 ± 1.14 | 3.95 ± 1.23 | 0.60s | PingPong, 8 steps |
| SA3 giant | 0.358 | 0.370 | 3.60 ± 0.94 | 3.85 ± 1.04 | 0.64s | PingPong, 8 steps |
Woosh DFlow achieves the quickest inference (0.06s) however at a high quality price — larger FAD than Woosh Flow. SA3 small-sfx, medium, and giant all outperform each competitor on FAD and CLAP on the 5s era size.
SA3 fashions don’t use classifier-free steering (CFG) at inference. CFG high quality positive aspects are internalized throughout distillation warmup coaching.
Key Takeaways
- Stable Audio 3 is a household of open-weight latent diffusion fashions (small, medium, giant) for music and sound results era and enhancing.
- A SAME autoencoder with 4096× downsampling compresses audio into 256-dimensional latents at ~10.76 Hz, making long-form era tractable on shopper {hardware}.
- Variable-length era is natively supported: inference price scales with requested period, not a set most size.
- Three-stage coaching (stream matching → distillation warmup → adversarial post-training) permits 8-step inference with out classifier-free steering.
- Prompt prefixes (
"TrackType: Music, VocalType: Instrumental,"/"TrackType: SFX,") are required for AudioSparx-trained mannequin variants.
Check out the Paper, Model Weights and Repo here. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing appeared first on MarkTechPost.
