MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation
Most textual content to video fashions generate a single clip from a immediate after which cease. They don’t preserve an inside world state that persists as actions arrive over time. PAN, a brand new mannequin from MBZUAI’s Institute of Foundation Models, is designed to fill that hole by performing as a common world mannequin that predicts future world states as video, conditioned on historical past and pure language actions.

From video generator to interactive world simulator
PAN is outlined as a common, interactable, lengthy horizon world mannequin. It maintains an inside latent state that represents the present world, then updates that state when it receives a pure language motion reminiscent of ‘flip left and velocity up’ or ‘transfer the robotic arm to the purple block.’ The mannequin then decodes the up to date state into a brief video section that reveals the consequence of that motion. This cycle repeats, so the identical world state evolves throughout many steps.
This design permits PAN to help open area, motion conditioned simulation. It can roll out counterfactual futures for various motion sequences. An exterior agent can question PAN as a simulator, evaluate predicted futures, and select actions based mostly on these predictions.
GLP structure, separating what occurs from the way it seems
The base of PAN is the Generative Latent Prediction, GLP, structure. GLP separates world dynamics from visible rendering. First, a imaginative and prescient encoder maps photographs or video frames right into a latent world state. Second, an autoregressive latent dynamics spine based mostly on a big language mannequin predicts the subsequent latent state, conditioned on historical past and the present motion. Third, a video diffusion decoder reconstructs the corresponding video section from that latent state.
In PAN, the imaginative and prescient encoder and spine are constructed on Qwen2.5-VL-7B-Instruct. The imaginative and prescient tower tokenizes frames into patches and produces structured embeddings. The language spine runs over a historical past of world states and actions, plus realized question tokens, and outputs the latent illustration of the subsequent world state. These latents dwell within the shared multimodal house of the VLM, which helps floor the dynamics in each textual content and imaginative and prescient.
The video diffusion decoder is customized from Wan2.1-T2V-14B, a diffusion transformer for prime constancy video technology. The analysis staff trains this decoder with a movement matching goal, utilizing one thousand denoising steps and a Rectified Flow formulation. The decoder circumstances on each the expected latent world state and the present pure language motion, with a devoted cross consideration stream for the world state and one other for the motion textual content.

Causal Swin DPM and sliding window diffusion
Naively chaining single shot video fashions by conditioning solely on the final body results in native discontinuities and speedy high quality degradation over lengthy rollouts. PAN addresses this with Causal Swin DPM, which augments the Shift Window Denoising Process Model with chunk sensible causal consideration.
The decoder operates on a sliding temporal window that holds two chunks of video frames at totally different noise ranges. During denoising, one chunk strikes from excessive noise to wash frames after which leaves the window. A new noisy chunk enters on the different finish. Chunk sensible causal consideration ensures that the later chunk can solely attend to the sooner one, to not unseen future actions. This retains transitions between chunks easy and reduces error accumulation over lengthy horizons.
PAN additionally provides managed noise to the conditioning body, quite than utilizing a wonderfully sharp body. This suppresses incidental pixel particulars that don’t matter for dynamics and encourages the mannequin to deal with steady construction reminiscent of objects and structure.

Training stack and knowledge building
PAN is skilled in two levels. In the primary stage, the analysis staff adapts Wan2.1 T2V 14B into the Causal Swin DPM structure. They prepare the decoder in BFloat16 with AdamW, a cosine schedule, gradient clipping, FlashAttention3 and FlexAttention kernels, and a hybrid sharded knowledge parallel scheme throughout 960 NVIDIA H200 GPUs.
In the second stage, they combine the frozen Qwen2.5 VL 7B Instruct spine with the video diffusion decoder underneath the GLP goal. The imaginative and prescient language mannequin stays frozen. The mannequin learns question embeddings and the decoder in order that predicted latents and reconstructed movies keep constant. This joint coaching additionally makes use of sequence parallelism and Ulysses model consideration sharding to deal with lengthy context sequences. Early stopping ends coaching after 1 epoch as soon as validation converges, though the schedule permits 5 epochs.
Training knowledge comes from broadly used publicly accessible video sources that cowl on a regular basis actions, human object interactions, pure environments, and multi agent eventualities. Long type movies are segmented into coherent clips utilizing shot boundary detection. A filtering pipeline removes static or overly dynamic clips, low aesthetic high quality, heavy textual content overlays, and display recordings utilizing rule based mostly metrics, pretrained detectors, and a customized VLM filter. The analysis staff then re-captions clips with dense, temporally grounded descriptions that emphasize movement and causal occasions.
Benchmarks, motion constancy, lengthy horizon stability, planning
The analysis staff evaluates the mannequin alongside three axes, motion simulation constancy, lengthy horizon forecast, and simulative reasoning and planning, in opposition to each open supply and business video turbines and world fashions. Baselines embody WAN 2.1 and a pair of.2, Cosmos 1 and a pair of, V JEPA 2, and business methods reminiscent of KLING, MiniMax Hailuo, and Gen 3.
For motion simulation constancy, a VLM based mostly choose scores how nicely the mannequin executes language specified actions whereas sustaining a steady background. PAN reaches 70.3% accuracy on agent simulation and 47% on surroundings simulation, for an total rating of 58.6%. It achieves the best constancy amongst open supply fashions and surpasses most business baselines.
For lengthy horizon forecast, the analysis staff measures Transition Smoothness and Simulation Consistency. Transition Smoothness makes use of optical movement acceleration to quantify how easy movement is throughout motion boundaries. Simulation Consistency makes use of metrics impressed by WorldRating to watch degradation over prolonged sequences. PAN scores 53.6% on Transition Smoothness and 64.1% on Simulation Consistency and exceeds all baselines, together with KLING and MiniMax, on these metrics.
For simulative reasoning and planning, PAN is used as an inside simulator inside an OpenAI-o3 based mostly agent loop. In step sensible simulation, PAN achieves 56.1% accuracy, the most effective amongst open supply world fashions.

Key Takwaways
- PAN implements the Generative Latent Prediction structure, combining a Qwen2.5-VL-7B based mostly latent dynamics spine with a Wan2.1-T2V-14B based mostly video diffusion decoder, to unify latent world reasoning and real looking video technology.
- The Causal Swin DPM mechanism introduces a sliding window, chunk sensible causal denoising course of that circumstances on partially noised previous chunks, which stabilizes lengthy horizon video rollouts and reduces temporal drift in comparison with naive final body conditioning.
- PAN is skilled in two levels, first adapting the Wan2.1 decoder to Causal Swin DPM on 960 NVIDIA H200 GPUs with a movement matching goal, then collectively coaching the GLP stack with a frozen Qwen2.5-VL spine and realized question embeddings plus decoder.
- The coaching corpus consists of enormous scale video motion pairs from numerous domains, processed with segmentation, filtering, and dense temporal recaptioning, enabling PAN to be taught motion conditioned, lengthy vary dynamics as an alternative of remoted brief clips.
- PAN achieves cutting-edge open supply outcomes on motion simulation constancy, lengthy horizon forecasting, and simulative planning, with reported scores reminiscent of 70.3% agent simulation, 47% surroundings simulation, 53.6% transition smoothness, and 64.1% simulation consistency, whereas remaining aggressive with main business methods.
Comparison Table
| Dimension | PAN | Cosmos video2world WFM | Wan2.1 T2V 14B | V JEPA 2 |
|---|---|---|---|---|
| Organization | MBZUAI Institute of Foundation Models | NVIDIA Research | Wan AI and Open Laboratory | Meta AI |
| Primary function | General world mannequin for interactive, lengthy horizon world simulation with pure language actions | World basis mannequin platform for Physical AI with video to world technology for management and navigation | High high quality textual content to video and picture to video generator for common content material creation and enhancing | Self supervised video mannequin for understanding, prediction and planning duties |
| World mannequin framing | Explicit GLP world mannequin, latent state, motion, and subsequent statement outlined, focuses on simulative reasoning and planning | Described as world basis mannequin that generates future video worlds from previous video and management immediate, aimed toward Physical AI, robotics, driving, navigation | Framed as video technology mannequin, not primarily as world mannequin, no persistent inside world state described in docs | Joint embedding predictive structure for video, focuses on latent prediction quite than express generative supervision in statement house |
| Core structure | GLP stack, imaginative and prescient encoder from Qwen2.5 VL 7B, LLM based mostly latent dynamics spine, video diffusion decoder with Causal Swin DPM | Family of diffusion based mostly and autoregressive world fashions, with video2world technology, plus diffusion decoder and immediate upsampler based mostly on a language mannequin | Spatio temporal variational autoencoder and diffusion transformer T2V mannequin at 14 billion parameters, helps a number of generative duties and resolutions | JEPA model encoder plus predictor structure that matches latent representations of consecutive video observations |
| Backbone and latent house | Multimodal latent house from Qwen2.5 VL 7B, used each for encoding observations and for autoregressive latent prediction underneath actions | Token based mostly video2world mannequin with textual content immediate conditioning and optionally available diffusion decoder for refinement, latent house particulars rely on mannequin variant | Latent house from VAE plus diffusion transformer, pushed primarily by textual content or picture prompts, no express agent motion sequence interface | Latent house constructed from self supervised video encoder with predictive loss in illustration house, not generative reconstruction loss |
| Action or management enter | Natural language actions in dialogue format, utilized at each simulation step, mannequin predicts subsequent latent state and decodes video conditioned on motion and historical past | Control enter as textual content immediate and optionally digital camera pose for navigation and downstream duties reminiscent of humanoid management and autonomous driving | Text prompts and picture inputs for content material management, no express multi step agent motion interface described as world mannequin management | Does not deal with pure language actions, used extra as visible illustration and predictor module inside bigger brokers or planners |
| Long horizon design | Causal Swin DPM sliding window diffusion, chunk sensible causal consideration, conditioning on barely noised final body to scale back drift and keep steady lengthy horizon rollouts | Video2world mannequin generates future video given previous window and immediate, helps navigation and lengthy sequences however the paper doesn’t describe a Causal Swin DPM model mechanism | Can generate a number of seconds at 480 P and 720 P, focuses on visible high quality and movement, lengthy horizon stability is evaluated by way of Wan Bench however with out express world state mechanism | Long temporal reasoning comes from predictive latent modeling and self supervised coaching, not from generative video rollouts with express diffusion home windows |
| Training knowledge focus | Large scale video motion pairs throughout numerous bodily and embodied domains, with segmentation, filtering and dense temporal recaptioning for motion conditioned dynamics | Mix of proprietary and public Internet movies centered on Physical AI classes reminiscent of driving, manipulation, human exercise, navigation and nature dynamics, with a devoted curation pipeline | Large open area video and picture corpora for common visible technology, with Wan Bench analysis prompts, not focused particularly at agent surroundings rollouts | Large scale unlabelled video knowledge for self supervised illustration studying and prediction, particulars in V JEPA 2 paper |
Editorial Comments
PAN is a vital step as a result of it operationalizes Generative Latent Prediction with manufacturing scale parts reminiscent of Qwen2.5-VL-7B and Wan2.1-T2V-14B, then validates this stack on nicely outlined benchmarks for motion simulation, lengthy horizon forecasting, and simulative planning. The coaching and analysis pipeline is clearly documented by the analysis staff, the metrics are reproducible, and the mannequin is launched inside a clear world modeling framework quite than as an opaque video demo. Overall, PAN reveals how a imaginative and prescient language spine plus diffusion video decoder can perform as a sensible world mannequin as an alternative of a pure generative toy.
Check out the Paper, Technical details and Project. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish MBZUAI Researchers Introduce PAN: A General World Model For Interactable Long Horizon Simulation appeared first on MarkTechPost.
