NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation
NVIDIA AI workforce have launched Cosmos 3. It is a household of omnimodal world fashions for bodily AI. The fashions mix bodily reasoning, world era, and motion era. All three capabilities dwell inside one open mannequin. NVIDIA open sourced the checkpoints, coaching scripts, deployment instruments, and datasets. The Cosmos 3 launch targets robotics, autonomous automobiles, and warehouse monitoring groups.
NVIDIA Cosmos 3
Physical AI techniques should perceive the world earlier than appearing in it. Robots and automobiles have to understand, predict, and then act. Earlier Cosmos releases cut up these jobs throughout separate fashions. Cosmos 3 unifies them with a Mixture-of-Transformers (MoT) structure. The structure is constructed round two towers.
The reasoner tower is a vision-language mannequin (VLM). It interprets photographs, movies, and textual content utilizing an autoregressive structure. It understands movement, object interactions, and different bodily context. NVIDIA workforce describes this tower because the mannequin’s mind.
The generator tower produces future observations and motion sequences. It makes use of a diffusion-based course of for physics-aware video and actions. These outputs are conditioned on the reasoner tower’s understanding. Information flows a method, from reasoner to generator. The reasoner can run alone. The generator at all times prompts each towers for guided era.
A single mannequin can due to this fact deal with reasoning and era collectively.

The Model Family
NVIDIA workforce describes three mannequin scales: Edge, Nano, and Super. Each makes use of the dual-tower Mixture-of-Transformers design. The two towers are initialized from pre-trained Qwen3-VL weights. That roughly doubles the parameter depend of the spine transformer.
Cosmos3-Nano is a 16B mannequin constructed on a dense 8B transformer. It adapts the Qwen3-VL 8B structure. Nano targets environment friendly inference on workstation GPUs. It runs on {hardware} just like the NVIDIA RTX PRO 6000. That fits real-time robotics and on-device bodily AI.
Cosmos3-Super is a 64B mannequin constructed on a dense 32B transformer. It adapts the Qwen3-VL 32B structure. Super targets datacenter GPUs, together with NVIDIA Hopper and Blackwell. It matches large-scale artificial information era and superior reasoning.
This launch ships Nano and Super, together with task-specific variants. These embody Super Text2Image, Super Image2Video, and Nano-Policy-DROID.
How the Unified Design Works
Both towers share one transformer structure and a joint consideration operator. They use a 3D multimodal rotary place embedding (mRoPE). mRoPE aligns video, audio, and motion tokens on one temporal axis. In Reasoner Mode, tokens move by means of causal self-attention. This allows next-token prediction for notion, planning, and reasoning. In Generator Mode, noisy tokens are denoised by means of full consideration. The autoregressive tokens are by no means up to date by the diffusion tokens.
The mannequin treats motion as a core modality with devoted motion tokens. Supported inputs embody textual content, picture, video, and JSON motion arrays. Outputs embody photographs, video, synchronized sound, motion states, and textual content. The reasoner follows Qwen3-VL-compatible message conventions for imaginative and prescient inputs.
Generation helps 256p, 480p, and 720p decision tiers. Frame counts vary from 5 to 300, defaulting to 189. That equals about 7.9 seconds of video at 24 FPS. Sound is generated as stereo AAC at 48 kHz. Action conditioning spans digicam, automobile, selfish, single-arm, dual-arm, and humanoid embodiments. Each embodiment makes use of a set motion dimension, equivalent to 9D for cameras.
The Benchmark Case
NVIDIA workforce evaluated Cosmos 3 throughout reasoning and era suites. On reasoning, Super and Nano lead VANTAGE-Bench at their respective tiers. VANTAGE-Bench assessments VLMs on real-world fixed-camera footage. It covers warehouses, transportation, and sensible areas. Cosmos 3 additionally tops the Traffic Anomaly Reasoning (TAR) leaderboard. TAR is the official leaderboard for AI City Challenge 2026 Track 3.
On era, NVIDIA reviews open-source state-of-the-art outcomes. Cosmos 3 is the open-source SOTA on R-Bench. It additionally leads PAI-Bench, Physics-IQ, and RoboLab on public leaderboards. On Artificial Analysis, it leads two open-source leaderboards. These cowl text-to-image and image-to-video with out audio.
NVIDIA workforce additionally launched its Cosmos Human Evaluation framework, referred to as HUE. HUE decomposes every generated video into sure/no reality questions. It scores 4 dimensions throughout seven bodily AI domains. The dimensions are semantic alignment, bodily legal guidelines, geometric reasoning, and visible integrity. A VLM pipeline drafts the questions, and human specialists refine them.
Marktechpost’s Visual Explainer
Published by Marktechpost · AI/ML analysis, fashions, and developer instruments for 1M+ readers
marktechpost.com
Key Takeaways
- Cosmos 3 is NVIDIA's open household of omnimodal world fashions, unifying bodily reasoning, world era, and motion era in a single mannequin.
- A two-tower Mixture-of-Transformers design pairs an autoregressive VLM reasoner with a diffusion generator, conditioned one-way from reasoner to generator.
- Two checkpoints ship now: Cosmos3-Nano (16B, dense 8B spine) for workstations and Cosmos3-Super (64B, dense 32B spine) for datacenters.
- NVIDIA open sourced the checkpoints, six SDG datasets, coaching recipes, and the HUE benchmark beneath the OpenMDW-1.1 license.
- It reviews open-source SOTA on R-Bench and main Artificial Analysis text-to-image and image-to-video outcomes.
Check out the Model Weights, GitHub Repo, Project Page and Technical details. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation appeared first on MarkTechPost.
