Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation
The Qwen workforce has launched three embodied AI fashions, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is constructed on a Qwen vision-language spine and targets a distinct robotics downside.
Qwen-RobotManip is a Vision-Language-Action mannequin for manipulation, constructed on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world mannequin with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation mannequin constructed on Qwen3-VL, out there at 2B, 4B, and 8B sizes.
Qwen-Robot-Suite
Qwen-Robot-Suite shouldn’t be a single mannequin. It is a set of three impartial basis fashions. Two of them, RobotManip and RobotNav, ship with public GitHub repositories.
Robotics knowledge is fragmented throughout {hardware} and duties. Different robots use incompatible remark and motion codecs. A coverage skilled on one arm not often transfers to a different.
The three analysis reviews handle this fragmentation in several methods. RobotManip aligns motion representations so manipulation knowledge scales. RobotWorld makes use of language as a unified motion interface for video prediction. RobotNav exposes a controllable remark interface for navigation duties.
Here is the core break up between the three releases:
| Model | Problem | Backbone | Output |
|---|---|---|---|
| Qwen-RobotManip | Robotic manipulation | Qwen3.5-4B (Qwen-VL) | Continuous robotic actions |
| Qwen-RobotWorld | Embodied world modeling | Frozen Qwen2.5-VL | Predicted future video |
| Qwen-RobotNav | Mobile navigation | Qwen3-VL (2B/4B/8B) | Waypoint trajectories |
Qwen-RobotManip: Alignment Unlocks Scale for Manipulation
Qwen-RobotManip is a Vision-Language-Action (VLA) basis mannequin. It is constructed on Qwen-VL and predicts steady robotic actions.
A VLA mannequin takes digicam views and a language instruction. It then outputs low-level robotic actions. The problem is that manipulation knowledge is heterogeneous by nature.
Different robots report states and actions in incompatible codecs. When demonstrations arrive with mismatched representations, scaling knowledge produces interference. RobotManip solves this with a unified alignment framework.
The Unified Alignment Framework
The framework has three complementary mechanisms. First is a canonical state-action illustration. It is an 80-dimensional vector with per-dimension binary masking.
This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block shops joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate solely the scale they’ve.
Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas within the digicam body. This makes visually related motions numerically proximate throughout embodiments.
Third is an in-context coverage adaptation mechanism. It reads latest execution historical past as an implicit embodiment identifier. The coverage adjusts conduct at deployment time with out parameter updates.
A dual-stream co-training technique runs alongside this. It collectively optimizes manipulation knowledge and a vision-language stream. This prevents the spine’s notion and reasoning from eroding.
The Data Engine
RobotManip assembles roughly 38,100 hours of manipulation knowledge. It makes use of solely open-source datasets and human movies. No proprietary knowledge assortment was used.
A human-to-robot synthesis pipeline produces most of this scale. It converts selfish hand demonstrations into robotic trajectories. The pipeline renders throughout 15 robotic platforms.
This synthesis alone yields about 24,808 hours of demonstrations. The selfish supply knowledge is about 1,933 hours. Open-source robotic datasets contribute over 11,000 hours.
The pipeline separates motion alignment from visible alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment makes use of SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics.
A five-stage curation pipeline then filters the mixed corpus. It catches sudden modifications, temporal misalignment, and excessive values. One test discovered 81% of episodes in a subset failed state-action alignment.
Benchmark Results
The analysis report argues normal benchmarks fail to measure generalization. Models with out robotic pretraining match pretrained ones on in-distribution assessments. RobotManip subsequently focuses on out-of-distribution (OOD) settings.
| Benchmark (OOD) | Prev. SOTA (π0.5) | Qwen-RobotManip |
|---|---|---|
| LIBERO-Plus | 84.4 | 91.4 |
| RoboTwin-C2R Hard | 47.9 | 69.4 |
| EBench | 27.1 | 45.6 |
| RoboCasa365 | 16.9 | 35.9 |
| RoboTwin-IF | 49.6 | 72.2 |
The largest reported hole is on cross-embodiment switch. RobotManip reaches 23.9% utilizing camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5.
The mannequin additionally ranks 1st on the RoboChallenge Table30-v1 generalist monitor. It scores a 20% relative enchancment over the prior greatest. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms.
Qwen-RobotWorld: Language as a Universal Action Interface
Qwen-RobotWorld is a language-conditioned video world mannequin. It predicts future visible trajectories from a present remark. Natural language serves because the unified motion interface.
A world mannequin learns surroundings dynamics. Given a present state and an motion, it predicts the subsequent state. RobotWorld represents states as video frames and actions as textual content.
This is essential as a result of language is embodiment-agnostic. One instruction encodes the motion sequence, aim, and constraints. It works throughout a Franka gripper, an Aloha dual-arm system, or a humanoid.
The Double-Stream MMDiT Architecture
The mannequin makes use of a 60-layer double-stream Multimodal Diffusion Transformer. An understanding stream processes a frozen Qwen2.5-VL encoder’s options. A technology stream processes video-VAE latents.
The two streams work together through joint consideration at each layer. Using an MLLM because the motion encoder offers two benefits. It parses compositional directions and constrains bodily believable transitions.
The MMDiT has 20B parameters. The VAE adopts the Wan-VAE structure. The context size helps as much as 48,360 video tokens.
A Scene2Robot mechanism reuses this spine for cross-embodiment synthesis. It processes scene, robotic reference, and technology segments collectively. This allows human-to-robot video switch with out robot-specific prompting.
The Embodied World Knowledge Dataset
Training makes use of the Embodied World Knowledge (EWK) dataset. It comprises roughly 8.6M video-text pairs. That spans over 200M remark frames.
The corpus covers 4 embodied domains plus basic video. Manipulation offers about 5.9M samples throughout 20+ morphologies. Driving, navigation, and human-to-robot switch fill out the remaining.
An action-language mapping framework standardizes all the things. It converts 20+ embodiment varieties and 500+ motion classes into language. A hierarchical five-layer annotation pipeline produces the captions.
Benchmark Results
RobotWorld was evaluated on 4 established benchmarks. It ranks 1st general on two of them:
| Benchmark | Result | Ranking |
|---|---|---|
| EWMBench | 4.60 | 1st general |
| DreamGen Bench | 4.952 | 1st general |
| WorldModelBench | 8.99 | 1st open-source (third general) |
| PBench | 0.804 | 1st open-source |
On EWMBench it leads movement constancy with an HSD of 0.566. That is a 33% achieve over the runner-up. Scene consistency reaches 0.914.
On WorldModelBench it scores 1.00 on 4 physics-adherence classes. These are Newton’s legal guidelines, mass conservation, fluid dynamics, and gravity. Penetration scores 0.94, and instruction following scores 2.33 out of three.0.
Qwen-RobotNav: A Controllable Interface for Navigation
Qwen-RobotNav is a scalable navigation mannequin constructed on Qwen3-VL. It reframes multi-task navigation as remark context modeling. The mannequin exposes a parameterized interface for exterior management.
Navigation spans many job households. Instruction following, point-goal navigation, object search, goal monitoring, and driving all differ. Each calls for a distinct technique for consuming the visible stream.
Instruction following wants lengthy reminiscence to re-reference landmarks. Target monitoring wants solely the latest frames. No mounted context technique serves all duties effectively.
The Parameterized Interface
RobotNav formulates all duties as waypoint trajectory prediction. It predicts 8 waypoints, every with a 2D place and heading. A light-weight 4-layer MLP head produces these from the spine.
The interface has two configuration dimensions. Task modes choose navigation conduct throughout VLN, PointNav, ObjNav, and Tracking. Observation parameters govern how visible historical past is encoded.
These remark controls embrace a visible token finances and temporal decay. They additionally embrace per-camera significance weights. Training-time randomization over all parameters ensures robustness.
Camera identification and temporal order use natural-language tags. This requires zero architectural modification to Qwen3-VL. Supporting a brand new platform wants solely a brand new immediate template.
The Agentic System
The interface makes RobotNav a constructing block for agentic methods. An upper-tier planner decomposes long-horizon objectives into sub-goals. Qwen3.6-Plus serves as this planner within the system.
The planner reconfigures RobotNav’s job mode mid-episode. RobotNav serves because the reactive executor. The two tiers talk solely by way of pure language.
A two-level reminiscence helps long-horizon reasoning. Single-episode reminiscence summarizes every rollout. Cross-episode reminiscence accumulates sturdy conclusions like searched areas.
Benchmark Results
RobotNav was skilled on 15.6M samples. Navigation trajectory knowledge kinds 85% of this. Vision-language reasoning knowledge fills the remaining 15%.
| Benchmark | Metric | Result |
|---|---|---|
| VLN-CE RxR (Val-Unseen) | Success Rate | 76.5% |
| VLN-CE R2R (Val-Unseen) | Success Rate | 72.1% |
| EVT-Bench | Tracking Rate | 90.0% |
| HM3Dv2 (ObjectNav) | Success Rate | 75.6% |
| NAVSIM | PDMS | 91.4 |
The agentic system units new state-of-the-art on Embodied Question Answering. It improves over the perfect prior methodology by 10.8% on HM-EQA. It additionally improves by 15.4% on EXPRESS-Bench whereas requiring 77% fewer navigation steps.
The report exhibits efficiency enhancing from 2B to 8B parameters. Joint multi-task coaching develops a shared spatial-planning substrate. The report states this transfers throughout job households.
Use Cases with Examples
Each mannequin maps to concrete deployment eventualities. The examples beneath mix report-supported outcomes with illustrative framing.
- RobotManip for few-shot deployment on new {hardware}: A workforce has a Franka arm and a handful of demonstrations. They fine-tune RobotManip on their very own workspace. The report exhibits the pretrained prior helps extra on muddle and unseen states than coaching from scratch.
- RobotManip for cross-embodiment talent switch: A coverage is collectively fine-tuned on 6K CobotMagic and 130 ARX demonstrations. It is then examined on 4 novel ARX duties with zero target-task demonstrations. The analysis reviews 55.0% success, over 4× the perfect ablated variant.
- RobotWorld as an artificial knowledge engine: A VLA coverage wants extra coaching knowledge than bodily assortment permits. The analysis workforce lists artificial knowledge technology as one among three software instructions. RobotWorld can generate video for new language directions.
- RobotWorld as a coverage analysis surroundings: The analysis lists coverage analysis as a second software route. A coverage will be run in opposition to generated trajectories earlier than actual {hardware}. This is offered as a route, not a benchmarked outcome.
- RobotNav inside an agentic system: An upper-tier planner decomposes a long-horizon aim into sub-goals. It dispatches navigation calls with totally different job modes and context settings. The analysis workforce’s agentic system improves over the perfect prior EQA methodology by 10.8% on HM-EQA.
- RobotNav for autonomous driving. The identical mannequin handles point-goal driving as one job mode. It reaches 91.4 PDMS on NAVSIM. The ahead digicam receives the best token weight by default.
Comparison Table: The Three Models
The desk beneath consolidates the technical particulars. It is a reference for choosing the right mannequin.
| Attribute | RobotManip | RobotWorld | RobotNav |
|---|---|---|---|
| Task kind | Manipulation (VLA) | Video world mannequin | Navigation |
| Backbone | Qwen3.5-4B | Frozen Qwen2.5-VL | Qwen3-VL |
| Action interface | Camera-frame EEF / joint | Natural language | Waypoint trajectories |
| Training knowledge | ~38,100 hours | 8.6M video-text pairs | 15.6M samples |
| Key structure | DiT flow-matching head | 60-layer double-stream MMDiT | MLP motion head |
| Headline outcome | 1st on RoboChallenge Table30-v1 | 1st on EWMBench, DreamGen | 76.5% SR on VLN-CE RxR |
| Output | Continuous actions | Predicted video | 8 waypoints (x, y, θ) |
| Public repo | Yes (GitHub) | Blog solely | Yes (GitHub) |
The three analysis reviews don’t current a mixed system. Read collectively, they cowl complementary layers. RobotWorld handles simulation and knowledge technology, RobotManip handles manipulation, and RobotNav handles mobility.
Implementation Note: The Canonical Action Vector
The RobotManip motion illustration is value understanding in code phrases. It is the mechanism that lets totally different robots share one mannequin. Below is a simplified illustration of the masking thought.
# Conceptual sketch of RobotManip's 80-dim canonical vector.
# Two 29-dim per-arm blocks + 22 reserved dimensions = 80.
# This is illustrative, not the official implementation.
CANONICAL_DIM = 80
# Per-arm semantic teams, per the report:
ARM_GROUPS = {
"joints": 7, # joint positions
"eef_pose": 9, # 3D place + 6D rotation
"gripper": 1, # parallel gripper width
"hand": 12, # dexterous hand joints
}
ARM_BLOCK = sum(ARM_GROUPS.values()) # 29
def build_masked_action(populated_groups, arms):
"""Build the motion vector and a per-dimension binary masks.
populated_groups: set of group names this robotic makes use of.
arms: 1 for single-arm, 2 for dual-arm.
Only populated dimensions carry supervision; the remaining are masked.
"""
motion = [0.0] * CANONICAL_DIM
masks = [0] * CANONICAL_DIM
idx = 0
for _ in vary(arms):
for group, dimension in ARM_GROUPS.objects():
if group in populated_groups:
for d in vary(idx, idx + dimension):
masks[d] = 1 # gradients circulation solely right here
idx += dimension
if arms == 1:
idx = ARM_BLOCK # skip to the second block
return motion, masks
# A 7-DOF single-arm gripper fills joints, eef_pose, gripper of 1 arm.
_, masks = build_masked_action({"joints", "eef_pose", "gripper"}, arms=1)
print(sum(masks)) # -> 17 populated dims; the remaining keep zero and masked
The per-dimension binary masks is the important thing thought. It ensures gradients circulation solely by way of semantically populated entries. This prevents spurious supervision on absent levels of freedom.
The identical masking precept seems within the flow-matching loss. Each pattern contributes equally no matter what number of dimensions are lively. This stops robots with extra populated slots from dominating optimization.
Key Takeaways
- Qwen launched three embodied AI fashions: RobotManip, RobotWorld, and RobotNav (grouped as Qwen-RobotSuite)
- RobotManip aligns robotic knowledge into one 80-dimensional motion vector and ranks 1st on RoboChallenge Table30-v1.
- RobotWorld makes use of pure language because the motion interface and ranks 1st general on EWMBench and DreamGen Bench.
- RobotNav exposes a controllable token-budget interface and hits 76.5% SR on VLN-CE RxR.
- Two of the three fashions ship with public GitHub repositories; RobotWorld is offered simply as a analysis paper.
Check out the Technical details and Papers (Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav). Also, be happy to observe us on Twitter and don’t neglect to hitch our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation appeared first on MarkTechPost.
