NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU
World fashions (methods that synthesize sensible video sequences from an preliminary picture and a set of actions) have gotten central to embodied AI, simulation, and robotics analysis. The core problem is scaling these methods to generate minute-long, high-resolution video with out requiring prohibitively giant clusters for each coaching and inference. Most aggressive open-source baselines both require multi-GPU inference or sacrifice decision to remain inside compute budgets.
NVIDIA’s SANA-WM straight targets these bottlenecks. Built on the SANA-Video codebase and obtainable by way of the NVlabs/Sana GitHub repository, it’s a 2.6B-parameter Diffusion Transformer (DiT) educated natively for one-minute era at 720p with metric-scale 6-DoF digicam management. It helps three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for sooner deployment. The distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.

The Architecture: Four Core Design Decisions
1. Hybrid Linear Attention with Gated DeltaInternet (GDN)
Standard softmax consideration has reminiscence and compute complexity that grows quadratically with sequence size — a major problem when producing 961 latent frames for a 60-second video at 720p. SANA-Video, the predecessor, used cumulative ReLU-based linear consideration, which maintains a constant-size recurrent state. However, this has no decay mechanism: all previous frames accumulate with equal weight, inflicting drift over minute-scale sequences.
SANA-WM replaces most consideration blocks with frame-wise Gated DeltaInternet (GDN). Unlike token-wise GDN utilized in language fashions, SANA-WM’s frame-wise variant processes one complete latent body per recurrent step. The GDN replace rule incorporates a decay gate γ (which down-weights stale previous frames) and a delta-rule correction (which updates solely the residual between the goal worth and the present state prediction), conserving the recurrent state at a fixed D×D measurement no matter video size.
To stabilize coaching, the analysis workforce introduces an algebraic key-scaling strategy: keys are scaled by 1/√(D·S), the place D is the top dimension and S is the variety of spatial tokens per body. This ensures the spectral norm of the transition matrix stays bounded and eliminates the NaN divergence occasions noticed with customary L2 key normalization (1/√D) or no scaling in any respect, each of which triggered NaN occasions at steps 16 and 1, respectively.
The last spine interleaves 15 frame-wise GDN blocks with 5 softmax consideration blocks (at layers 3, 7, 11, 15, and 19) throughout 20 complete transformer blocks. The softmax blocks present actual long-range recall the place GDN’s recurrence alone is inadequate.
2. Dual-Branch Camera Control
Camera-controlled world modeling requires the mannequin to faithfully observe a steady 6-DoF trajectory, not simply align with a textual content description of movement. SANA-WM makes use of two complementary branches that function at totally different temporal charges:
- Coarse department (UCPE consideration): Operates on the latent-frame charge. For every latent token, it computes a ray-local digicam foundation from the camera-to-world pose and intrinsics, then applies a Unified Camera Positional Encoding (UCPE) to the geometric channels of every consideration head. This captures international trajectory construction throughout the complete sequence.
- Fine department (Plücker mixing): Addresses a compression mismatch. Each latent token summarizes eight uncooked frames, every with its personal distinct digicam pose. The effective department computes pixel-wise Plücker raymaps (a 6D illustration: ray course d and second o×d) from all eight uncooked frames inside one VAE temporal stride, packs them into a 48-channel tensor, and injects this embedding after every self-attention output by way of a zero-initialized projection. This restores intra-stride digicam movement that the coarse department can’t see at latent-frame decision.
Ablations on OmniWorld present that neither department alone matches the twin strategy: UCPE-only achieves a Camera Motion Consistency (CamMC) of 0.2453, whereas UCPE + Plücker mixing reaches 0.2047.
3. Two-Stage Generation Pipeline
Stage-1 SANA-WM outputs, whereas spatiotemporally constant, can comprise structural artifacts over lengthy sequences. A second-stage refiner, initialized from the 17B LTX-2 mannequin with rank-384 LoRA adapters fine-tuned on paired artificial and actual video knowledge, corrects these artifacts. It makes use of truncated-σ stream matching: stage-1 latents are perturbed with a giant beginning noise (σ_start = 0.9), and the refiner learns to map this noisy enter towards the high-fidelity goal. Only three Euler denoising steps are wanted at inference. The refiner reduces long-horizon visible drift (ΔIQ) from 3.79 to 1.17 on the Simple-Trajectory break up, and from 3.09 to 0.31 on the Hard-Trajectory break up.
4. Robust Data Annotation Pipeline
Training camera-controlled video era requires metric-scale 6-DoF pose annotations, the data not obtainable in customary video datasets. The analysis workforce modified VIPE (a camera-pose annotation engine) by changing its depth backend with Pi3X (for long-sequence-consistent depth) fused with MoGe-2 (for correct per-frame metric scale). They additionally prolonged the bundle adjustment stage to deal with focal lengths and principal factors as per-frame variables somewhat than shared international intrinsics, enabling extra strong annotation on web video with various focal lengths.
The ensuing pipeline processes seven coaching corpus entries drawn from a number of open-source sources: SpatialVID-HQ (actual, 10s clips), DL3DV actual clips (10s), DL3DV GS Refined artificial clips (60s, rendered by way of 3D Gaussian Splatting), OmniWorld (artificial, 60s), Sekai Game (artificial, 60s), Sekai Walking-HQ (actual, 60s), and MiraData (actual, 60s). This yields a complete of 212,975 clips with metric-scale pose annotations. The LTX2-VAE used for compression is 2.0× smaller than ST-DC-AE and eight.0× smaller than Wan2.1-VAE, which straight improves coaching and inference effectivity.
For DL3DV, which accommodates static 3D scene captures somewhat than native one-minute movies, the analysis workforce match one FCGS 3D Gaussian Splatting reconstruction per scene, designed various one-minute digicam paths, rendered lengthy movies with recognized intrinsics and extrinsics, after which refined the rendered outputs with DiFix3D to cut back splatting artifacts.
Training Strategy and Infrastructure
SANA-WM’s compute entails two phases on 64 H100 GPUs. First, earlier than DiT coaching, the workforce adapts the LTX2 VAE to the SANA-Video SFT coaching knowledge in roughly 50K steps, taking roughly 3.5 days. The most important DiT coaching then follows a four-stage progressive schedule lasting roughly 15 days:
- Stage 1 (~2.75 days): Adapt the pre-trained SANA-Video mannequin to the frame-wise GDN structure on brief (5s) video clips. This replaces cumulative linear consideration with the recurrent GDN blocks on a cheaper, short-horizon coaching regime the place failure modes are simpler to diagnose.
- Stage 2 (~2 days): Introduce hybrid consideration by changing each fourth GDN block with a customary softmax consideration block on the identical short-clip setting, enhancing the effectivity–high quality trade-off.
- Stage 3 (~8 days): Extend coaching to 961-frame (60-second) sequences and incorporate Dual-Branch Camera Control. Context-Parallel (CP=2) sharding distributes the latent sequence throughout GPUs utilizing prefix-sum composition of GDN transition matrices — a mathematically actual parallelization technique requiring minimal communication overhead.
- Stage 4 (~2.5 days): Fine-tune a chunk-causal variant for autoregressive rollout, then apply self-forcing distillation to cut back sampling to 4 denoising steps. Attention-sink tokens and native temporal home windows are added to the softmax consideration layers to maintain reminiscence and per-chunk latency fixed throughout lengthy rollouts.
Custom fused Triton kernels for GDN scan and gate operations contribute roughly 1.5× to 2× effectivity good points all through coaching.
Benchmark Results
The analysis workforce introduces a purpose-built 60-second world-model benchmark with 80 preliminary scenes generated by Nano Banana Pro throughout 4 scene classes recreation, indoor, outdoor-city, and outdoor-nature (20 per class). Each paired with Simple and Hard digicam trajectory splits. The most important analysis makes use of every mannequin’s multi-step, undistilled autoregressive setting.

On this benchmark, SANA-WM with the second-stage refiner achieves the next throughout each splits:
- Camera accuracy (Simple / Hard): Rotation error (RotErr) of 4.50° / 8.34°; Translation error (TransErr) of 1.39 / 1.39; CamMC of 1.41 / 1.44 — the most effective amongst all in contrast strategies, together with LingBot-World (14B+14B parameters, 8 GPUs) and HY-WorldPlay (8B parameters, 8 GPUs).
- Visual high quality: 80.62 / 81.89 VBench Overall on Simple / Hard splits, corresponding to LingBot-World (81.82 / 81.89) whereas producing 720p outputs on a single GPU per clip.
- Throughput: 22.0 movies/hour on 8 H100s with the complete pipeline (refiner included), in comparison with 0.6 movies/hour for LingBot-World — a 36× throughput benefit.
- Memory: The full pipeline matches in 74.7 GB, inside the 80 GB H100 finances. Stage-1-only inference matches in 51.1 GB.
- Temporal stability: After refinement, ΔIQ (imaging high quality degradation from first to final 10-second window) drops to 1.17 on Simple and 0.31 on Hard, in comparison with 23.59 and 25.88 for HY-WorldPlay.
Marktechpost’s Visual Explainer
Key Takeaways
- NVIDIA’s SANA-WM generates 60-second, 720p, camera-controlled movies on a single GPU — educated in ~18.5 days on 64 H100s with solely 212,975 public video clips.
- A hybrid Gated DeltaInternet + softmax consideration spine retains the recurrent state at a fixed D×D measurement no matter video size, fixing the reminiscence explosion that makes minute-scale era impractical with customary softmax consideration.
- Dual-branch digicam management — UCPE on the latent-frame charge and Plücker mixing on the raw-frame charge — brings CamMC all the way down to 0.2047, the most effective amongst all in contrast strategies together with fashions 5× bigger.
- A second-stage refiner initialized from 17B LTX-2 with rank-384 LoRA cuts long-horizon visible drift (ΔIQ) from 3.09 to 0.31 on Hard trajectories utilizing simply 3 Euler denoising steps.
- At 22.0 movies/hour on 8 H100s, SANA-WM + refiner delivers 36× greater throughput than LingBot-World (14B+14B, 8 GPUs) at comparable VBench visible high quality scores.
Check out the Paper, GitHub Repo and Project Page. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU appeared first on MarkTechPost.
