|

Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes

Video basis fashions can paint a phenomenal body. They are nonetheless notoriously unhealthy at remembering it. Push the digital camera by means of a hall in Wan 2.1 or CogVideoX and partitions warp, objects morph, and particulars vanish — the giveaway that these fashions are becoming 2D pixel correlations slightly than simulating a coherent 3D scene.

A workforce of researchers from Microsoft Research and Zhejiang University launched World-R1: a framework that aligns video technology with 3D constraints by means of reinforcement studying. The analysis workforce lean on a latest discovering that video basis fashions already encode wealthy 3D geometric data internally. The job, then, is to elicit that latent data slightly than supervise it with costly 3D belongings. World-R1 does this by post-training an current text-to-video (T2V) mannequin with reinforcement studying, utilizing rewards derived from pre-trained 3D basis fashions and a vision-language critic. The base structure is left untouched and inference price is unchanged.

Two World-R1 variants are launched: World-R1-Small (constructed on Wan2.1-T2V-1.3B) and World-R1-Large (constructed on Wan2.1-T2V-14B).

https://arxiv.org/pdf/2604.24764

The setup: Flow-GRPO on a flow-matching video mannequin

World-R1 makes use of Flow-GRPO-Fast, a latest adaptation of GRPO to flow-matching diffusion fashions. Flow-GRPO converts the deterministic ODE sampler right into a reverse-time SDE so the coverage is stochastic sufficient for benefit estimation, then optimizes a clipped GRPO surrogate with KL regularization to a reference coverage. The Fast variant solely injects SDE noise at randomly chosen intermediate steps to reduce rollout price.

Training runs at 832×480 decision on 48 NVIDIA H200 GPUs for the Small mannequin and 96 H200s for the Large mannequin, with a GRPO group dimension of G=8 throughout 48 parallel teams.

The 3D-aware reward: analysis-by-synthesis

The attention-grabbing work occurs within the reward. For every generated video x, the system reconstructs a 3D Gaussian Splatting (3DGS) illustration ΦGS utilizing Depth Anything 3 and recovers an estimated digital camera trajectory Ê. The composite 3D reward is:

R3D = Smeta + Srecon + Straj

  • Smeta renders ΦGS from a meta-view — a digital camera pose offset from the technology trajectory — and asks Qwen3-VL to rating the reconstruction from 0–9 as a “3D imaginative and prescient skilled,” penalizing floaters, billboard artifacts, and texture stretching that look positive head-on however collapse off-axis.
  • Srecon re-renders the scene alongside Ê and compares towards x through 1 − LPIPS.
  • Straj measures deviation between the requested trajectory E and the recovered Ê utilizing L2 for translation and geodesic distance for rotation, wrapped in a unfavourable exponential.

A common aesthetic time period Rgen, computed because the imply HPSv3 rating throughout the primary Ok frames, is added with λgen = 1 to hold visible high quality from collapsing below geometric stress.

Implicit digital camera conditioning through noise wrapping

Rather than coaching a CameraCtrl-style adapter, World-R1 follows the Go-with-the-Flow paradigm: the immediate is parsed for movement tokens (push_in, orbit_left, pull_out, and many others.), a sequence of digital camera extrinsics is generated, projected into 2D optical movement below a fronto-parallel scene assumption, and used to carry out discrete noise transport on the preliminary latent. The transported noise preserves unit variance through a density-tracker normalization, so the diffusion prior is undisturbed however the latent already encodes the requested trajectory. No new parameters, no architectural change.

A pure textual content dataset, and periodic decoupling to hold movement alive

Training information is an artificial Pure Text Dataset of roughly 3,000 prompts generated by Gemini, organized alongside the WorldScore camera-trajectory taxonomy (intra-scene, inter-scene, composite, static) and throughout Natural Landscapes, Urban & Architectural, Micro & Still Life, Fantasy & Surrealism, and Artistic Styles. Going text-only dissociates 3D studying from the visible biases of any particular video corpus.

Strict 3D rewards have a recognized failure mode: the mannequin overfits to inflexible scenes and stops producing dynamic content material. World-R1 mitigates this with periodic decoupled coaching. Every 100 steps, R3D is suspended and the mannequin is fine-tuned with Rgen alone on a roughly 500-prompt dynamic information subset (waterfalls, crowds, hearth, remodeling objects). Removing this stage truly raises reconstruction PSNR however drops VBench AVG from 85.21 to 82.64 — precisely the reward-hacking degeneracy the analysis workforce flags.

Understanding the Results

On a 3DGS-based reconstruction protocol, World-R1-Large hits 27.67 PSNR / 0.865 SSIM / 0.162 LPIPS, towards 19.76 / 0.629 / 0.405 for Wan2.1-T2V-14B — a 7.91 dB PSNR acquire. World-R1-Small posts a ten.23 dB acquire over its 1.3B spine. On the reconstruction-independent Multi-View Consistency Score (MVCS) borrowed from GeoVideo, World-R1-Large reaches 0.993, forward of all 3D-conditioned and camera-control baselines examined (Voyager, ViewCrafter, FlashWorld, ReCamMaster, and many others.).

Camera management is aggressive with specialised strategies: RotErr 1.21, TransErr 1.30, CamMC 2.95 for the Large mannequin, edging out CamCloneMaster and ReCamMaster regardless of not being a devoted camera-control structure. VBench scores enhance over the bottom Wan 2.1 in Aesthetic Quality, Imaging Quality, Motion Smoothness, and Subject Consistency, with solely a small regression on Background Consistency.

Two robustness outcomes stand out for AI professionals. A dataset scaling sweep exhibits monotonic features from 1K → 2K → 3K prompts on each 3D consistency and VBench AVG, suggesting the recipe is data-efficient and may scale additional. And though coaching is on brief clips, World-R1-Large generalizes to 121-frame generations, lifting PSNR from 18.32 to 26.32 over the Wan2.1-T2V-14B spine. A 25-participant double-blind consumer research stories win charges of 92% for geometric consistency, 76% for digital camera management accuracy, and 86% for total desire versus Wan 2.1.

Key Takeaways

  • RL replaces architectural surgical procedure for 3D consistency. World-R1 post-trains Wan2.1 with Flow-GRPO-Fast as an alternative of bolting on 3D modules or coaching on 3D-supervised datasets. The base structure and inference price are unchanged.
  • The reward is analysis-by-synthesis. Each generated video is lifted to a 3D Gaussian Splatting illustration through Depth Anything 3, then scored on three axes: meta-view plausibility (judged by Qwen3-VL), reconstruction constancy (1 − LPIPS), and trajectory alignment — mixed with an HPSv3 aesthetic reward to forestall high quality collapse.
  • Camera management comes from noise wrapping, not new parameters. Motion tokens within the immediate are became digital camera extrinsics, projected to 2D optical movement, and used to warp the preliminary latent through Go-with-the-Flow’s discrete noise transport. No CameraCtrl-style adapter required.
  • Periodic decoupled coaching prevents reward hacking. Every 100 steps, the 3D reward is suspended and the mannequin is fine-tuned with the aesthetic reward alone on ~500 dynamic prompts. Removing this stage raises PSNR however tanks VBench — the mannequin collapses into static, easy-to-reconstruct outputs.
  • The numbers are giant and maintain up off-pipeline. World-R1-Large features 7.91 dB PSNR over Wan2.1-T2V-14B, generalizes to 121-frame movies, and improves the reconstruction-independent MVCS metric — with an 86% total desire win charge in a 25-participant blind consumer research.

Check out the Paper, Codes and Project Page. Also, be at liberty to observe us on Twitter and don’t overlook to be a part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The publish Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes appeared first on MarkTechPost.

Similar Posts