Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

Stability AI has launched open weights for Stable Audio 3 together with a technical research paper. Stable Audio 3 is a household of latent diffusion fashions that generate stereo audio at 44.1 okHz. The fashions help variable-length outputs, inpainting-based enhancing, and quick inference.

What Is Stable Audio 3?

Stable Audio 3 is a household of three mannequin scales: small, medium, and giant. A latent diffusion mannequin generates audio by studying to progressively take away noise from a compressed illustration of audio, known as a latent. The mannequin learns a mapping from noise to information by coaching on many (noisy latent, audio) pairs.

The three mannequin scales differ in capability and most era size. All parameter counts under are for the diffusion transformer element solely. Each mannequin additionally features a SAME autoencoder (108M parameters for SAME-S, 852M for SAME-L).

small-music — 459M diffusion transformer parameters, as much as 2 minutes, music solely.
small-sfx — 459M diffusion transformer parameters, as much as 2 minutes, sound results solely.
medium — 1.4B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.
giant — 2.7B diffusion transformer parameters, as much as 6 minutes and 20 seconds, music and sound results.

Open weights for small and medium can be found on Hugging Face. Large is out there below an enterprise license.

Architecture: Two Components

Stable Audio 3 has two principal parts: a semantic-acoustic autoencoder known as SAME, and a diffusion transformer that generates latent sequences conditioned on textual content, period, and inpainting masks.

The SAME Autoencoder

SAME (Semantically-Aligned Music autoEncoder) converts stereo 44.1 okHz audio right into a compact latent illustration and again. Its key design parameter is a 4096× downsampling ratio — considerably larger than the 1024× to 2048× ratios widespread in prior audio autoencoders. This larger ratio reduces latent sequence lengths sufficient for long-form era to run on shopper {hardware}.

SAME achieves its 4096× compression by way of two phases. First, a patching stage reshapes stereo audio into non-overlapping patches of 256 samples per channel, attaining 256× downsampling. Second, a Transformer Resampling Block (TRB) applies an additional 16× downsampling utilizing learnable output embeddings interleaved with the enter sequence, processed by way of a transformer. The mixed output is a 256-dimensional latent sequence at roughly 10.76 Hz for a 44.1 okHz enter.

The SAME autoencoder is educated with 5 loss varieties: spectral reconstruction, adversarial, diffusion alignment, semantic regression (predicting chroma and interaural stage distinction), and contrastive latent alignment. These losses push the latent to protect each acoustic reconstruction high quality and semantic construction. A soft-normalisation bottleneck constrains the dimensions of the latent, offering deterministic encoding.

The SAME autoencoder is frozen throughout diffusion coaching. Small fashions use SAME-S (108M parameters, optimized for CPU inference); medium and giant use SAME-L (852M parameters).

The Diffusion Transformer

The diffusion transformer operates on SAME latents. Conditioning enters by way of three pathways:

Text — a frozen T5Gemma encoder produces a sequence of 256 embeddings of dimension 768. Short prompts are padded to 256 with a discovered embedding; lengthy prompts are truncated.
Duration — encoded as a Fourier options vector and injected through each Adaptive Layer Normalization (AdaLN) and cross-attention alongside the textual content immediate.
Inpainting — a binary masks concatenated with the masked reference audio is projected by way of a 2-layer MLP and added to the residual stream of every transformer block.

Each transformer block comprises self-attention, cross-attention, local-additive conditioning for inpainting, and a SwiGLU feed-forward community. Medium and giant use differential consideration, which computes two separate consideration maps utilizing two (Q, Ok) pairs sharing one set of values V, then subtracts one map from the opposite. This cancels consideration patterns which might be widespread to each heads. The transformer prepends 64 learnable reminiscence embeddings earlier than processing every sequence. These present a worldwide context buffer that each place can attend to, and are eliminated earlier than computing any loss.

Variable-Length Generation

Most prior latent diffusion fashions for audio function at a set most sequence size. Generating a brief clip nonetheless requires working inference at full size, losing compute on silence. Stable Audio 3 is educated to generate audio at variable lengths natively, utilizing three mechanisms:

Variable-length flash consideration and masked loss — sequences shorter than the batch most are right-padded in latent area. Padding positions are excluded from self-attention and from the loss.
Per-element timestep shifts — longer sequences retain extra construction at a given noise stage resulting from redundancy between neighboring components. To compensate, the noise schedule is shifted towards larger noise ranges for longer sequences throughout coaching, utilizing a logistic shift parameterized by µ (interpolating between µmin=0.5 and µmax=1.15 primarily based on sequence size).
Silence augmentation — the sign area is randomly prolonged with pre-computed silence embeddings drawn from an exponential distribution, averaging 4 seconds. This teaches the mannequin to terminate audio with pure silence.

The sensible result’s that inference price scales with output period. Medium generates 20 seconds of audio in roughly 0.62 seconds on an H200. Generating 380 seconds takes 1.31 seconds on the identical {hardware}.

Three-Stage Training Pipeline

Stage 1 — Flow Matching Pre-Training. The mannequin learns a velocity area that transports Gaussian noise towards audio latents. Training makes use of minibatch optimum transport coupling through Sinkhorn iterations, which pairs every information pattern with the closest out there noise vector within the batch. This straightens coaching trajectories and reduces crossing transport paths. Inpainting is educated collectively all through: at every step, one of three masks varieties is sampled — full masks (80%, equal to unconditional era), random section masks (10%), or a causal prefix masks for continuation (10%).

Stage 2 — Distillation Warmup. A frozen copy of the stream matching mannequin (instructor) generates 15-step DPM++ trajectories with CFG scale 5. The scholar is educated for 10,000 steps to map any intermediate noisy state on to the instructor’s last denoised output in a single step, utilizing an MSE loss. This collapses the multi-step ODE right into a single-step denoiser. The trade-off is that MSE regression produces outputs that regress towards the conditional imply, decreasing fine-grained element.

Stage 3 — Adversarial Post-Training. This stage replaces the MSE goal with a relativistic adversarial setup. A discriminator (initialized from the bottom stream matching mannequin) evaluates the coed’s one-step denoised outputs immediately in opposition to actual information. The instructor is discarded solely at this stage. The generator is educated with two losses: a relativistic adversarial loss (L_R) and a CLAP alignment loss (L_CLAP). The discriminator is educated with L_R and a contrastive loss (L_C) that penalizes the discriminator for ignoring text-audio alignment (it’s educated to tell apart appropriately paired audio-text pairs from shuffled ones). The adversarial setup permits the mannequin to recuperate the perceptual sharpness that MSE distillation removes.

Inference: Ping-Pong Sampling and No CFG

The post-trained mannequin can generate audio in a single ahead go. However, single-step era from pure noise stays troublesome. Stable Audio 3 makes use of ping-pong sampling at inference: the mannequin denoises to a clear estimate, then provides new noise at a diminished stage, then denoises once more. This repeats for 8 steps utilizing a logSNR-uniform schedule (N+1 equally-spaced steps within the interval [λmin, λmax] = [−6.2, 2.0]). The iterative denoise-then-renoise schedule permits every step to right errors from the earlier step.

Stable Audio 3 does not require classifier-free steering (CFG) at inference. Standard diffusion fashions run two ahead passes per step — one conditional, one unconditional — and interpolate. Here, CFG high quality positive aspects are internalized throughout distillation warmup, the place the coed is educated to match CFG-enhanced instructor trajectories. Text-audio alignment is additional bolstered by way of L_CLAP throughout adversarial post-training. This eliminates the two-pass-per-step price of CFG.

Prompt formatting be aware: All Stable Audio 3 fashions educated on AudioSparx (small-music, medium, giant) require immediate prefixes to perform appropriately. Music prompts must be prepended with "TrackType: Music, VocalType: Instrumental," and sound results prompts with "TrackType: SFX,".

Evaluation Results

Instrumental music (Song Describer Dataset, 120s). On FAD (decrease is healthier) and CLAP rating (larger is healthier), giant achieves FAD 0.101 / CLAP 0.393. Medium achieves FAD 0.107 / CLAP 0.390. Stable Audio 2.5 (the inner prior-generation baseline) achieves FAD 0.106 / CLAP 0.395. In the listening check, medium and giant rating larger on musicality (MUS) than Stable Audio 2.5 (4.15 and 4.30 vs. 3.70 out of 5, respectively). Inference time for 120s audio on an H200: 0.45s for small, 0.78s for medium, 0.81s for giant. Stable Audio 2.5 takes 0.85s for the identical size.

Sound results (BBC Sound Effects Dataset, 5s). Medium achieves FAD 0.369 / CLAP 0.369. The next-best open-weight baselines are Stable Audio Open Small (FAD 0.500 / CLAP 0.277) and Stable Audio Open (FAD 0.501 / CLAP 0.263). Woosh Flow scores FAD 0.580.

Audio enhancing (inpainting). The analysis crew evaluates three inpainting settings: single area, two impartial areas, and continuation. For music, medium achieves FAD-full of 0.046 on single inpainting and 0.046 on double inpainting. Large achieves 0.047 on each. For continuation, medium achieves FAD-full 0.074 and giant achieves 0.071. Sound results outcomes observe an identical sample; continuation exhibits larger FAD than inpainting in each domains, which the crew attributes to the mannequin having much less surrounding audio context to anchor the era.

Comparison

Model specs

Music benchmarks (SDD, 120s)

SFX benchmarks (BBC, 5s)

Model	Developer	Released	Architecture	Parameters	Max size	Sample charge	Domain	Open weights	Inpainting
STABLE AUDIO LINEAGE
Stable Audio Open	Stability AI	Jul 2024	Latent diffusion (DiT)	DiT 1057M + AE 156M + T5 109M	47s	44.1kHz stereo	Music + SFX	Yes	No
Stable Audio Open Small	Stability AI	2024	Latent diffusion (DiT)	Not printed	11s	44.1kHz stereo	SFX	Yes	No
Stable Audio 2.5	Stability AI	Internal	Latent diffusion (DiT)	Not printed	190s (3m 10s)	44.1kHz stereo	Music	Not launched	No
SA3 small-music ★	Stability AI	May 2026	Latent diffusion (SAME + DiT)	DT 459M + SAME-S 108M	2m	44.1kHz stereo	Music solely	Yes	Yes
SA3 small-sfx ★	Stability AI	May 2026	Latent diffusion (SAME + DiT)	DT 459M + SAME-S 108M	2m	44.1kHz stereo	SFX solely	Yes	Yes
SA3 medium ★	Stability AI	May 2026	Latent diffusion (SAME + DiT)	DT 1.4B + SAME-L 852M	6m 20s	44.1kHz stereo	Music + SFX	Yes	Yes
SA3 giant ★	Stability AI	May 2026	Latent diffusion (SAME + DiT)	DT 2.7B + SAME-L 852M	6m 20s	44.1kHz stereo	Music + SFX	Enterprise	Yes
COMPETITORS
TangoFlux	SUTD / NVIDIA / Lambda	Dec 2024	Flow matching (DiT + MMDiT)	515M	30s	44.1kHz	SFX	Yes (Apache 2.0)	No
Woosh Flow	Sony AI	Apr 2026	Flow matching	Not printed	5s	Not disclosed	SFX	Yes (MIT)	No
Woosh DFlow	Sony AI	Apr 2026	Distilled stream matching	Not printed	5s	Not disclosed	SFX	Yes (MIT)	No
DiffRhythm 2	ASLP Lab (NPU)	Oct 2025	Block stream matching (semi-autoregressive)	Not printed	210s (3m 30s)	48kHz output	Music + vocals	Yes	No
ACE-Step 1.5	ACE Studio / StepEnjoyable	Jan 2026	Hybrid LM (0.6B–4B) + DiT (as much as 4B)	LM 0.6B–4B + XL DiT 4B	10m	Not disclosed	Music + vocals + lyrics	Yes	No

★ SA3 rows: Parameter counts are for the diffusion transformer (DT) element solely; SAME autoencoder params are listed individually. Total mannequin dimension together with SAME: small ~567M, medium ~2.25B, giant ~3.55B.
Stable Audio 2.5 is an inside Stability AI mannequin not publicly launched; included as prior-generation inside baseline from the SA3 paper.
DiffRhythm 2 VAE processes 24kHz enter audio and reconstructs at 48kHz (arXiv:2510.22950).

Evaluation setup: Song Describer Dataset (SDD), 120s instrumental music generations, H200 GPU. FAD makes use of LAION-CLAP embeddings (630k-audioset-best.pt). OVL/REL/MUS are imply opinion scores (1–5) from a 14-participant listening check. Source: SA3 paper Tables 3 and 4. Bold + underline = finest rating in column.

Model	FAD ↓	CLAP ↑	OVL ↑ (1–5)	REL ↑ (1–5)	MUS ↑ (1–5)	Inference (H200)	Sampler / steps
COMPETITORS
DiffRhythm 2	0.293	0.158	3.05 ± 0.94	2.10 ± 1.29	2.60 ± 1.10	3.88s	—
ACE-Step 1.5 xl-turbo	0.193	0.321	3.35 ± 1.09	3.30 ± 1.13	3.15 ± 1.31	6.23s	—
STABILITY AI — PRIOR GENERATION
Stable Audio 2.5 (inside)	0.106	0.395	3.90 ± 0.79	4.30 ± 0.66	3.70 ± 0.92	0.85s	DPM++ 3M SDE, 8 steps, CFG 6
STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG)
SA3 small-music	0.145	0.393	3.20 ± 0.89	3.60 ± 0.94	3.15 ± 0.81	0.45s	PingPong, 8 steps
SA3 medium	0.107	0.390	4.20 ± 0.89	4.25 ± 0.85	4.15 ± 0.93	0.78s	PingPong, 8 steps
SA3 giant	0.101	0.393	3.95 ± 0.89	3.80 ± 1.11	4.30 ± 0.73	0.81s	PingPong, 8 steps

FAD: Fréchet Audio Distance — decrease is healthier. CLAP: cosine similarity between textual content and audio embeddings — larger is healthier.
OVL = general manufacturing high quality. REL = textual content relevance. MUS = musicality (melody/concord coherence).
ACE-Step 1.5 and DiffRhythm 2 evaluated with instrumental prompts solely for honest comparability with SA3 (instrumental-only fashions). SA3 base stream matching fashions (50 steps, CFG 7, Euler sampler) will not be proven right here; see SA3 paper Table 11 for that comparability.

Evaluation setup: BBC Sound Effects Dataset, ≤5s generations matched to reference period, H200 GPU. FAD makes use of LAION-CLAP embeddings. OVL/REL from 14-participant listening check. Source: SA3 paper Table 5. Bold + underline = finest rating in column.

Model	FAD ↓	CLAP ↑	OVL ↑ (1–5)	REL ↑ (1–5)	Inference (H200)	Sampler / steps
COMPETITORS
TangoFlux	0.760	0.179	2.35 ± 1.04	3.25 ± 1.37	1.90s	Flow matching, 50 steps, CFG 4.5
Woosh DFlow	0.619	0.228	3.10 ± 1.25	3.20 ± 1.64	0.06s	Distilled stream, 4 steps
Woosh Flow	0.580	0.277	3.45 ± 1.19	3.80 ± 1.28	1.92s	Adaptive ODE (~72 steps avg)
STABILITY AI — PRIOR GENERATION
Stable Audio Open	0.501	0.263	2.95 ± 1.32	3.30 ± 1.30	12.30s	DPM++ 3M SDE, 100 steps, CFG 7
Stable Audio Open Small	0.500	0.277	3.10 ± 1.12	3.55 ± 1.00	0.24s	PingPong, 8 steps
STABLE AUDIO 3 — POST-TRAINED (8 PING-PONG STEPS, NO CFG)
SA3 small-sfx	0.395	0.351	3.35 ± 1.39	3.25 ± 1.45	0.41s	PingPong, 8 steps
SA3 medium	0.369	0.369	3.65 ± 1.14	3.95 ± 1.23	0.60s	PingPong, 8 steps
SA3 giant	0.358	0.370	3.60 ± 0.94	3.85 ± 1.04	0.64s	PingPong, 8 steps

Woosh DFlow achieves the quickest inference (0.06s) however at a high quality price — larger FAD than Woosh Flow. SA3 small-sfx, medium, and giant all outperform each competitor on FAD and CLAP on the 5s era size.
SA3 fashions don’t use classifier-free steering (CFG) at inference. CFG high quality positive aspects are internalized throughout distillation warmup coaching.

Key Takeaways

Stable Audio 3 is a household of open-weight latent diffusion fashions (small, medium, giant) for music and sound results era and enhancing.
A SAME autoencoder with 4096× downsampling compresses audio into 256-dimensional latents at ~10.76 Hz, making long-form era tractable on shopper {hardware}.
Variable-length era is natively supported: inference price scales with requested period, not a set most size.
Three-stage coaching (stream matching → distillation warmup → adversarial post-training) permits 8-step inference with out classifier-free steering.
Prompt prefixes ("TrackType: Music, VocalType: Instrumental," / "TrackType: SFX,") are required for AudioSparx-trained mannequin variants.

Check out the Paper, Model Weights and Repo here. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing appeared first on MarkTechPost.

Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing

What Is Stable Audio 3?

Architecture: Two Components

The SAME Autoencoder

The Diffusion Transformer

Variable-Length Generation

Three-Stage Training Pipeline

Inference: Ping-Pong Sampling and No CFG

Evaluation Results

Comparison

Key Takeaways

Moonshot AI Releases Kimi K2: A Trillion-Parameter MoE Model Focused on Long Context, Code, Reasoning, and Agentic Behavior

Generative AI trends 2025: LLMs, data scaling & enterprise adoption

Step by Step Guide to Build and Compare FedAvg and FedProx Federated Learning on Non-IID CIFAR-10 with NVIDIA FLARE

DeepSeek Just Released a 3B OCR Model: A 3B VLM Designed for High-Performance OCR and Structured Document Conversion

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)

A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What Is Stable Audio 3?

Architecture: Two Components

The SAME Autoencoder

The Diffusion Transformer

Variable-Length Generation

Three-Stage Training Pipeline

Inference: Ping-Pong Sampling and No CFG

Evaluation Results

Comparison

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!