NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA AI workforce have launched Cosmos 3. It is a household of omnimodal world fashions for bodily AI. The fashions mix bodily reasoning, world era, and motion era. All three capabilities dwell inside one open mannequin. NVIDIA open sourced the checkpoints, coaching scripts, deployment instruments, and datasets. The Cosmos 3 launch targets robotics, autonomous automobiles, and warehouse monitoring groups.

NVIDIA Cosmos 3

Physical AI techniques should perceive the world earlier than appearing in it. Robots and automobiles have to understand, predict, and then act. Earlier Cosmos releases cut up these jobs throughout separate fashions. Cosmos 3 unifies them with a Mixture-of-Transformers (MoT) structure. The structure is constructed round two towers.

The reasoner tower is a vision-language mannequin (VLM). It interprets photographs, movies, and textual content utilizing an autoregressive structure. It understands movement, object interactions, and different bodily context. NVIDIA workforce describes this tower because the mannequin’s mind.

The generator tower produces future observations and motion sequences. It makes use of a diffusion-based course of for physics-aware video and actions. These outputs are conditioned on the reasoner tower’s understanding. Information flows a method, from reasoner to generator. The reasoner can run alone. The generator at all times prompts each towers for guided era.

A single mannequin can due to this fact deal with reasoning and era collectively.

https://developer.nvidia.com/weblog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3

The Model Family

NVIDIA workforce describes three mannequin scales: Edge, Nano, and Super. Each makes use of the dual-tower Mixture-of-Transformers design. The two towers are initialized from pre-trained Qwen3-VL weights. That roughly doubles the parameter depend of the spine transformer.

Cosmos3-Nano is a 16B mannequin constructed on a dense 8B transformer. It adapts the Qwen3-VL 8B structure. Nano targets environment friendly inference on workstation GPUs. It runs on {hardware} just like the NVIDIA RTX PRO 6000. That fits real-time robotics and on-device bodily AI.

Cosmos3-Super is a 64B mannequin constructed on a dense 32B transformer. It adapts the Qwen3-VL 32B structure. Super targets datacenter GPUs, together with NVIDIA Hopper and Blackwell. It matches large-scale artificial information era and superior reasoning.

This launch ships Nano and Super, together with task-specific variants. These embody Super Text2Image, Super Image2Video, and Nano-Policy-DROID.

How the Unified Design Works

Both towers share one transformer structure and a joint consideration operator. They use a 3D multimodal rotary place embedding (mRoPE). mRoPE aligns video, audio, and motion tokens on one temporal axis. In Reasoner Mode, tokens move by means of causal self-attention. This allows next-token prediction for notion, planning, and reasoning. In Generator Mode, noisy tokens are denoised by means of full consideration. The autoregressive tokens are by no means up to date by the diffusion tokens.

The mannequin treats motion as a core modality with devoted motion tokens. Supported inputs embody textual content, picture, video, and JSON motion arrays. Outputs embody photographs, video, synchronized sound, motion states, and textual content. The reasoner follows Qwen3-VL-compatible message conventions for imaginative and prescient inputs.

Generation helps 256p, 480p, and 720p decision tiers. Frame counts vary from 5 to 300, defaulting to 189. That equals about 7.9 seconds of video at 24 FPS. Sound is generated as stereo AAC at 48 kHz. Action conditioning spans digicam, automobile, selfish, single-arm, dual-arm, and humanoid embodiments. Each embodiment makes use of a set motion dimension, equivalent to 9D for cameras.

The Benchmark Case

NVIDIA workforce evaluated Cosmos 3 throughout reasoning and era suites. On reasoning, Super and Nano lead VANTAGE-Bench at their respective tiers. VANTAGE-Bench assessments VLMs on real-world fixed-camera footage. It covers warehouses, transportation, and sensible areas. Cosmos 3 additionally tops the Traffic Anomaly Reasoning (TAR) leaderboard. TAR is the official leaderboard for AI City Challenge 2026 Track 3.

On era, NVIDIA reviews open-source state-of-the-art outcomes. Cosmos 3 is the open-source SOTA on R-Bench. It additionally leads PAI-Bench, Physics-IQ, and RoboLab on public leaderboards. On Artificial Analysis, it leads two open-source leaderboards. These cowl text-to-image and image-to-video with out audio.

NVIDIA workforce additionally launched its Cosmos Human Evaluation framework, referred to as HUE. HUE decomposes every generated video into sure/no reality questions. It scores 4 dimensions throughout seven bodily AI domains. The dimensions are semantic alignment, bodily legal guidelines, geometric reasoning, and visible integrity. A VLM pipeline drafts the questions, and human specialists refine them.

Marktechpost’s Visual Explainer

marktechpost@information ~ /nvidia/cosmos-3
01 / 09

DEVELOPER GUIDE · PHYSICAL AI

NVIDIA Cosmos 3

Open omnimodal world fashions for bodily AI.

Released May 31, 2026. One mannequin for bodily reasoning, world era, and motion era.

Mixture-of-Transformers
Open weights
OpenMDW-1.1

Use ← → or swipe to navigate

01 · WHAT IT IS

A unified mannequin for understanding and era

Cosmos 3 is a household of omnimodal world fashions for bodily AI. Earlier Cosmos releases cut up jobs throughout separate fashions. Cosmos 3 unifies them in a single open mannequin.

Physical reasoning over photographs, video, and textual content.
World era of physics-aware video and sound.
Action era for robots and autonomous techniques.

Subsumes VLMs, video mills, world simulators, and world-action fashions.

02 · ARCHITECTURE

Two towers, one transformer

REASONER TOWER

An autoregressive vision-language mannequin (VLM). It interprets movement, object interactions, and bodily context. NVIDIA calls it the mannequin’s mind.

GENERATOR TOWER

A diffusion-based path for physics-aware video and actions. It is conditioned on the reasoner’s understanding.

Information flows a method, reasoner → generator. Both towers share a 3D multimodal RoPE (mRoPE).

03 · MODEL FAMILY

Pick a measurement in your {hardware}

Cosmos3-Nano
16B complete (dense 8B, Qwen3-VL 8B). Workstation GPUs like RTX PRO 6000. Real-time robotics.

Cosmos3-Super
64B complete (dense 32B, Qwen3-VL 32B). Datacenter Hopper and Blackwell GPUs. Large-scale SDG.

Cosmos3-Edge
4B complete (dense 2B). On-device scale. Planned for a later launch.

Plus variants: Super-Text2Image, Super-Image2Video, and Nano-Policy-DROID.

04 · MODALITIES

Inputs, outputs, and era settings

Inputs: textual content, picture, video, and JSON motion arrays.
Outputs: picture, video, synchronized sound, motion states, textual content.
Resolution: 256p, 480p, 720p. Sound: stereo AAC at 48 kHz.
Length: 5 to 300 frames; default 189 (about 7.9s at 24 FPS).
Embodiments: digicam, automobile, selfish, single-arm, dual-arm, humanoid.

05 · BENCHMARKS

What NVIDIA reviews

REASONING

Nano and Super lead VANTAGE-Bench at their tiers. Cosmos 3 tops TAR, the AI City Challenge 2026 Track 3 leaderboard.

GENERATION

Open-source SOTA on R-Bench. Leads PAI-Bench, Physics-IQ, and RoboLab. Top open-source on Artificial Analysis text-to-image and image-to-video.

HUE evaluates movies with sure/no reality checks throughout 4 dimensions and seven domains.

06 · OPEN RELEASE

Everything ships open

Checkpoints for Nano, Super, and task-specific variants.
Six SDG datasets: robotics, physics, spatial reasoning, human movement, driving, warehouses.
Training recipes: SFT plus motion post-training.
Action modes: ahead dynamics, inverse dynamics, and coverage era.
License: OpenMDW-1.1.

07 · DEPLOYMENT

Run it in manufacturing

NIM microservices: Reasoner NIM accessible now; Generator NIM later.
Quantization: BF16, FP8, and NVFP4. NVFP4 offers as much as 2x speedup.
Serving: the Reasoner NIM stack is constructed on vLLM.
Efficient Video Sampling (EVS): prunes redundant video tokens at inference.

Use Diffusers and Transformers for analysis; vLLM-Omni and vLLM for serving.

08 · LIMITATIONS & START

Know the caveats, then construct

Outputs can present temporal inconsistency, unstable movement, object morphing, inaccurate 3D construction, and sound-video misalignment. Safety-critical management wants validation, guardrails, and system-level evaluation.

GitHubgithub.com/nvidia/cosmos

Hugging Facehuggingface.co/collections/nvidia/cosmos3

Published by Marktechpost · AI/ML analysis, fashions, and developer instruments for 1M+ readers
marktechpost.com

Key Takeaways

Cosmos 3 is NVIDIA's open household of omnimodal world fashions, unifying bodily reasoning, world era, and motion era in a single mannequin.
A two-tower Mixture-of-Transformers design pairs an autoregressive VLM reasoner with a diffusion generator, conditioned one-way from reasoner to generator.
Two checkpoints ship now: Cosmos3-Nano (16B, dense 8B spine) for workstations and Cosmos3-Super (64B, dense 32B spine) for datacenters.
NVIDIA open sourced the checkpoints, six SDG datasets, coaching recipes, and the HUE benchmark beneath the OpenMDW-1.1 license.
It reviews open-source SOTA on R-Bench and main Artificial Analysis text-to-image and image-to-video outcomes.

Check out the Model Weights, GitHub Repo, Project Page and Technical details. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation appeared first on MarkTechPost.

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA Cosmos 3

The Model Family

How the Unified Design Works

The Benchmark Case

Marktechpost’s Visual Explainer

NVIDIA Cosmos 3