|

Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

Stability AI has launched open weights for Stable Audio 3 together with a technical research paper. Stable Audio 3 is a household of latent diffusion fashions that generate stereo audio at 44.1 okHz. The fashions help variable-length outputs, inpainting-based enhancing, and quick inference.

What Is Stable Audio 3?

Stable Audio 3 is a household of three mannequin scales: small, medium, and giant. A latent diffusion mannequin generates audio by studying to progressively take away noise from a compressed illustration of audio, known as a latent. The mannequin learns a mapping from noise to information by coaching on many (noisy latent, audio) pairs.

The three mannequin scales differ in capability and most era size. All parameter counts under are for the diffusion transformer element solely. Each mannequin additionally features a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L).

  • small-music — 459M diffusion transformer parameters, as much as 2 minutes, music solely.
  • small-sfx — 459M diffusion transformer parameters, as much as 2 minutes, sound results solely.
  • medium — 1.4B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.
  • giant — 2.7B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.

Open weights for small and medium can be found on Hugging Face. Large is out there below an enterprise license.

Architecture: Two Components

Stable Audio 3 has two principal parts: a semantic-acoustic autoencoder known as SAME, and a diffusion transformer that generates latent sequences conditioned on textual content, period, and inpainting masks.

https://arxiv.org/pdf/2605.17991

The SAME Autoencoder

SAME (Semantically-Aligned Music autoEncoder) converts stereo 44.1 okHz audio right into a compact latent illustration and again. Its key design parameter is a 4096× downsampling ratio — considerably larger than the 1024× to 2048× ratios widespread in prior audio autoencoders. This larger ratio reduces latent sequence lengths sufficient for long-form era to run on shopper {hardware}.

SAME achieves its 4096× compression by way of two phases. First, a patching stage reshapes stereo audio into non-overlapping patches of 256 samples per channel, attaining 256× downsampling. Second, a Transformer Resampling Block (TRB) applies an additional 16× downsampling utilizing learnable output embeddings interleaved with the enter sequence, processed by way of a transformer. The mixed output is a 256-dimensional latent sequence at roughly 10.76 Hz for a 44.1 okHz enter.

The SAME autoencoder is educated with 5 loss varieties: spectral reconstruction, adversarial, diffusion alignment, semantic regression (predicting chroma and interaural stage distinction), and contrastive latent alignment. These losses push the latent to protect each acoustic reconstruction high quality and semantic construction. A soft-normalisation bottleneck constrains the dimensions of the latent, offering deterministic encoding.

The SAME autoencoder is frozen throughout diffusion coaching. Small fashions use SAME-S (108M parameters, optimized for CPU inference); medium and giant use SAME-L (852M parameters).

The Diffusion Transformer

The diffusion transformer operates on SAME latents. Conditioning enters by way of three pathways:

  1. Text — a frozen T5Gemma encoder produces a sequence of 256 embeddings of dimension 768. Short prompts are padded to 256 with a discovered embedding; lengthy prompts are truncated.
  2. Duration — encoded as a Fourier options vector and injected through each Adaptive Layer Normalization (AdaLN) and cross-attention alongside the textual content immediate.
  3. Inpainting — a binary masks concatenated with the masked reference audio is projected by way of a 2-layer MLP and added to the residual stream of every transformer block.

Each transformer block comprises self-attention, cross-attention, local-additive conditioning for inpainting, and a SwiGLU feed-forward community. Medium and giant use differential consideration, which computes two separate consideration maps utilizing two (Q, Ok) pairs sharing one set of values V, then subtracts one map from the opposite. This cancels consideration patterns which might be widespread to each heads. The transformer prepends 64 learnable reminiscence embeddings earlier than processing every sequence. These present a worldwide context buffer that each place can attend to, and are eliminated earlier than computing any loss.

Variable-Length Generation

Most prior latent diffusion fashions for audio function at a set most sequence size. Generating a brief clip nonetheless requires working inference at full size, losing compute on silence. Stable Audio 3 is educated to generate audio at variable lengths natively, utilizing three mechanisms:

  • Variable-length flash consideration and masked loss — sequences shorter than the batch most are right-padded in latent area. Padding positions are excluded from self-attention and from the loss.
  • Per-element timestep shifts — longer sequences retain extra construction at a given noise stage resulting from redundancy between neighboring components. To compensate, the noise schedule is shifted towards larger noise ranges for longer sequences throughout coaching, utilizing a logistic shift parameterized by µ (interpolating between µmin=0.5 and µmax=1.15 primarily based on sequence size).
  • Silence augmentation — the sign area is randomly prolonged with pre-computed silence embeddings drawn from an exponential distribution, averaging 4 seconds. This teaches the mannequin to terminate audio with pure silence.

The sensible result’s that inference price scales with output period. Medium generates 20 seconds of audio in roughly 0.62 seconds on an H200. Generating 380 seconds takes 1.31 seconds on the identical {hardware}.

Three-Stage Training Pipeline

Stage 1 — Flow Matching Pre-Training. The mannequin learns a velocity area that transports Gaussian noise towards audio latents. Training makes use of minibatch optimum transport coupling through Sinkhorn iterations, which pairs every information pattern with the closest out there noise vector within the batch. This straightens coaching trajectories and reduces crossing transport paths. Inpainting is educated collectively all through: at every step, one of three masks varieties is sampled — full masks (80%, equal to unconditional era), random section masks (10%), or a causal prefix masks for continuation (10%).

Stage 2 — Distillation Warmup. A frozen copy of the stream matching mannequin (instructor) generates 15-step DPM++ trajectories with CFG scale 5. The scholar is educated for 10,000 steps to map any intermediate noisy state on to the instructor’s last denoised output in a single step, utilizing an MSE loss. This collapses the multi-step ODE right into a single-step denoiser. The trade-off is that MSE regression produces outputs that regress towards the conditional imply, decreasing fine-grained element.

Stage 3 — Adversarial Post-Training. This stage replaces the MSE goal with a relativistic adversarial setup. A discriminator (initialized from the bottom stream matching mannequin) evaluates the coed’s one-step denoised outputs immediately in opposition to actual information. The instructor is discarded solely at this stage. The generator is educated with two losses: a relativistic adversarial loss (L_R) and a CLAP alignment loss (L_CLAP). The discriminator is educated with L_R and a contrastive loss (L_C) that penalizes the discriminator for ignoring text-audio alignment (it’s educated to tell apart appropriately paired audio-text pairs from shuffled ones). The adversarial setup permits the mannequin to recuperate the perceptual sharpness that MSE distillation removes.

Inference: Ping-Pong Sampling and No CFG

The post-trained mannequin can generate audio in a single ahead go. However, single-step era from pure noise stays troublesome. Stable Audio 3 makes use of ping-pong sampling at inference: the mannequin denoises to a clear estimate, then provides new noise at a diminished stage, then denoises once more. This repeats for 8 steps utilizing a logSNR-uniform schedule (N+1 equally-spaced steps within the interval [λmin, λmax] = [−6.2, 2.0]). The iterative denoise-then-renoise schedule permits every step to right errors from the earlier step.

Stable Audio 3 does not require classifier-free steering (CFG) at inference. Standard diffusion fashions run two ahead passes per step — one conditional, one unconditional — and interpolate. Here, CFG high quality positive aspects are internalized throughout distillation warmup, the place the coed is educated to match CFG-enhanced instructor trajectories. Text-audio alignment is additional bolstered by way of L_CLAP throughout adversarial post-training. This eliminates the two-pass-per-step price of CFG.

Prompt formatting be aware: All Stable Audio 3 fashions educated on AudioSparx (small-music, medium, giant) require immediate prefixes to perform appropriately. Music prompts must be prepended with "TrackType: Music, VocalType: Instrumental," and sound results prompts with "TrackType: SFX,".

Evaluation Results

Instrumental music (Song Describer Dataset, 120s). On FAD (decrease is healthier) and CLAP rating (larger is healthier), giant achieves FAD 0.101 / CLAP 0.393. Medium achieves FAD 0.107 / CLAP 0.390. Stable Audio 2.5 (the inner prior-generation baseline) achieves FAD 0.106 / CLAP 0.395. In the listening check, medium and giant rating larger on musicality (MUS) than Stable Audio 2.5 (4.15 and 4.30 vs. 3.70 out of 5, respectively). Inference time for 120s audio on an H200: 0.45s for small, 0.78s for medium, 0.81s for giant. Stable Audio 2.5 takes 0.85s for the identical size.

Sound results (BBC Sound Effects Dataset, 5s). Medium achieves FAD 0.369 / CLAP 0.369. The next-best open-weight baselines are Stable Audio Open Small (FAD 0.500 / CLAP 0.277) and Stable Audio Open (FAD 0.501 / CLAP 0.263). Woosh Flow scores FAD 0.580.

Audio enhancing (inpainting). The analysis crew evaluates three inpainting settings: single area, two impartial areas, and continuation. For music, medium achieves FAD-full of 0.046 on single inpainting and 0.046 on double inpainting. Large achieves 0.047 on each. For continuation, medium achieves FAD-full 0.074 and giant achieves 0.071. Sound results outcomes observe an identical sample; continuation exhibits larger FAD than inpainting in each domains, which the crew attributes to the mannequin having much less surrounding audio context to anchor the era.

Comparison

Model specs
Music benchmarks (SDD, 120s)
SFX benchmarks (BBC, 5s)

Model Developer Released Architecture Parameters Max size Sample charge Domain Open weights Inpainting
STABLE AUDIO LINEAGE
Stable Audio Open Stability AI Jul 2024 Latent diffusion (DiT) DiT 1057M + AE 156M + T5 109M 47s 44.1kHz stereo Music + SFX Yes No
Stable Audio Open Small Stability AI 2024 Latent diffusion (DiT) Not printed 11s 44.1kHz stereo SFX Yes No
Stable Audio 2.5 Stability AI Internal Latent diffusion (DiT) Not printed 190s (3m 10s) 44.1kHz stereo Music Not launched No
SA3 small-music ★ Stability AI May 2026 Latent diffusion (SAME + DiT) DT 459M + SAME-S 108M 2m 44.1kHz stereo Music solely Yes Yes
SA3 small-sfx ★ Stability AI May 2026 Latent diffusion (SAME + DiT) DT 459M + SAME-S 108M 2m 44.1kHz stereo SFX solely Yes Yes
SA3 medium ★ Stability AI May 2026 Latent diffusion (SAME + DiT) DT 1.4B + SAME-L 852M 6m 20s 44.1kHz stereo Music + SFX Yes Yes
SA3 giant ★ Stability AI May 2026 Latent diffusion (SAME + DiT) DT 2.7B + SAME-L 852M 6m 20s 44.1kHz stereo Music + SFX Enterprise Yes
COMPETITORS
TangoFlux SUTD / NVIDIA / Lambda Dec 2024 Flow matching (DiT + MMDiT) 515M 30s 44.1kHz SFX Yes (Apache 2.0) No
Woosh Flow Sony AI Apr 2026 Flow matching Not printed 5s Not disclosed SFX Yes (MIT) No
Woosh DFlow Sony AI Apr 2026 Distilled stream matching Not printed 5s Not disclosed SFX Yes (MIT) No
DiffRhythm 2 ASLP Lab (NPU) Oct 2025 Block stream matching (semi-autoregressive) Not printed 210s (3m 30s) 48kHz output Music + vocals Yes No
ACE-Step 1.5 ACE Studio / StepEnjoyable Jan 2026 Hybrid LM (0.6B–4B) + DiT (as much as 4B) LM 0.6B–4B + XL DiT 4B 10m Not disclosed Music + vocals + lyrics Yes No

★ SA3 rows: Parameter counts are for the diffusion transformer (DT) element solely; SAME autoencoder params are listed individually. Total mannequin dimension together with SAME: small ~567M, medium ~2.25B, giant ~3.55B.
Stable Audio 2.5 is an inside Stability AI mannequin not publicly launched; included as prior-generation inside baseline from the SA3 paper.
DiffRhythm 2 VAE processes 24kHz enter audio and reconstructs at 48kHz (arXiv:2510.22950).

Evaluation setup: Song Describer Dataset (SDD), 120s instrumental music generations, H200 GPU. FAD makes use of LAION-CLAP embeddings (630k-audioset-best.pt). OVL/REL/MUS are imply opinion scores (1–5) from a 14-participant listening check. Source: SA3 paper Tables 3 and 4. Bold + underline = finest rating in column.

Model FAD ↓ CLAP ↑ OVL ↑ (1–5) REL ↑ (1–5) MUS ↑ (1–5) Inference (H200) Sampler / steps
COMPETITORS
DiffRhythm 2 0.293 0.158 3.05 ± 0.94 2.10 ± 1.29 2.60 ± 1.10 3.88s
ACE-Step 1.5 xl-turbo 0.193 0.321 3.35 ± 1.09 3.30 ± 1.13 3.15 ± 1.31 6.23s
STABILITY AI — PRIOR GENERATION
Stable Audio 2.5 (inside) 0.106 0.395 3.90 ± 0.79 4.30 ± 0.66 3.70 ± 0.92 0.85s DPM++ 3M SDE, 8 steps, CFG 6
STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG)
SA3 small-music 0.145 0.393 3.20 ± 0.89 3.60 ± 0.94 3.15 ± 0.81 0.45s PingPong, 8 steps
SA3 medium 0.107 0.390 4.20 ± 0.89 4.25 ± 0.85 4.15 ± 0.93 0.78s PingPong, 8 steps
SA3 giant 0.101 0.393 3.95 ± 0.89 3.80 ± 1.11 4.30 ± 0.73 0.81s PingPong, 8 steps

FAD: Fréchet Audio Distance — decrease is healthier. CLAP: cosine similarity between textual content and audio embeddings — larger is healthier.
OVL = general manufacturing high quality. REL = textual content relevance. MUS = musicality (melody/concord coherence).
ACE-Step 1.5 and DiffRhythm 2 evaluated with instrumental prompts solely for honest comparability with SA3 (instrumental-only fashions). SA3 base stream matching fashions (50 steps, CFG 7, Euler sampler) will not be proven right here; see SA3 paper Table 11 for that comparability.

Evaluation setup: BBC Sound Effects Dataset, ≤5s generations matched to reference period, H200 GPU. FAD makes use of LAION-CLAP embeddings. OVL/REL from 14-participant listening check. Source: SA3 paper Table 5. Bold + underline = finest rating in column.

Model FAD ↓ CLAP ↑ OVL ↑ (1–5) REL ↑ (1–5) Inference (H200) Sampler / steps
COMPETITORS
TangoFlux 0.760 0.179 2.35 ± 1.04 3.25 ± 1.37 1.90s Flow matching, 50 steps, CFG 4.5
Woosh DFlow 0.619 0.228 3.10 ± 1.25 3.20 ± 1.64 0.06s Distilled stream, 4 steps
Woosh Flow 0.580 0.277 3.45 ± 1.19 3.80 ± 1.28 1.92s Adaptive ODE (~72 steps avg)
STABILITY AI — PRIOR GENERATION
Stable Audio Open 0.501 0.263 2.95 ± 1.32 3.30 ± 1.30 12.30s DPM++ 3M SDE, 100 steps, CFG 7
Stable Audio Open Small 0.500 0.277 3.10 ± 1.12 3.55 ± 1.00 0.24s PingPong, 8 steps
STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG)
SA3 small-sfx 0.395 0.351 3.35 ± 1.39 3.25 ± 1.45 0.41s PingPong, 8 steps
SA3 medium 0.369 0.369 3.65 ± 1.14 3.95 ± 1.23 0.60s PingPong, 8 steps
SA3 giant 0.358 0.370 3.60 ± 0.94 3.85 ± 1.04 0.64s PingPong, 8 steps

Woosh DFlow achieves the quickest inference (0.06s) however at a high quality price — larger FAD than Woosh Flow. SA3 small-sfx, medium, and giant all outperform each competitor on FAD and CLAP on the 5s era size.
SA3 fashions don’t use classifier-free steering (CFG) at inference. CFG high quality positive aspects are internalized throughout distillation warmup coaching.

Key Takeaways

  • Stable Audio 3 is a household of open-weight latent diffusion fashions (small, medium, giant) for music and sound results era and enhancing.
  • A SAME autoencoder with 4096× downsampling compresses audio into 256-dimensional latents at ~10.76 Hz, making long-form era tractable on shopper {hardware}.
  • Variable-length era is natively supported: inference price scales with requested period, not a set most size.
  • Three-stage coaching (stream matching → distillation warmup → adversarial post-training) permits 8-step inference with out classifier-free steering.
  • Prompt prefixes ("TrackType: Music, VocalType: Instrumental," / "TrackType: SFX,") are required for AudioSparx-trained mannequin variants.


Check out the Paper, Model Weights and Repo hereAlso, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing appeared first on MarkTechPost.

Similar Posts