NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning

NVIDIA Research has launched SpatialClaw, a training-free framework for spatial reasoning. It targets a persistent weak point in vision-language fashions (VLMs). These fashions nonetheless battle to evaluate the place objects are, how they relate, and the way they transfer in 3D.

SpatialClaw doesn’t retrain the mannequin. Instead, it adjustments the motion interface the agent makes use of to name notion instruments. The analysis group argues the interface is the bottleneck. Their resolution is to deal with code as the motion interface. Across 20 benchmarks, SpatialClaw reaches 59.9% common accuracy. It outperforms the current spatial agent SpaceTools by 11.2 factors.

What is SpatialClaw

SpatialClaw is an agent loop wrapped round a stateful Python kernel. The kernel is pre-loaded with enter frames and a set of primitives. Perception instruments are plain Python callables. Their outputs, together with masks, depth maps, digicam geometry, and trajectories, are extraordinary Python variables.

The kernel exposes six public entry factors. InputImages holds the sampled frames. Metadata carries body fee, period, and body indices. instruments exposes notion and geometry primitives. present() embeds a picture into the agent’s subsequent context. vlm dispatches queries to a separate VLM session. ReturnAnswer() submits the closing reply.

Two notion instruments are central. instruments.Reconstruct wraps Depth Anything 3 and returns per-frame depth, digicam intrinsics, extrinsics, and dense level maps. instruments.SAM3 wraps SAM 3 and produces picture or video masks from textual content, level, or field prompts. The framework provides light-weight utilities: instruments.Geometry, instruments.Mask, instruments.Time, instruments.Graph, and instruments.Draw.

It is training-free. The similar system immediate, software set, and hyperparameters run throughout each benchmark and spine.

https://spatialclaw.github.io/static/pdfs/spatialclaw.pdf

Why the Action Interface Matters

The analysis group studied three motion interfaces on the similar query. Consider measuring the closest distance between a heater and a door.

Single-pass code writes one full program and runs it as soon as. It commits to a full technique earlier than seeing any intermediate masks or depth map. A incorrect assumption then propagates straight to the reply.
Structured tool-call invokes named instruments by way of a set JSON schema. It can not freely mix outputs with NumPy or SciPy to specific test-time computations. The closest-point operation has no pre-registered software, so the result’s incorrect.
SpatialClaw composes instruments in code, inspects outcomes, then revises. It first computes a centroid distance, then notices the centroid makes use of a median. The agent switches to scipy.spatial.KDTree to search out the true closest level. It submits 0.9439 m towards a 0.9 m floor reality.

Benchmark

SpatialClaw was examined on 20 benchmarks throughout 5 classes. These span single-image, multi-view, normal, video and 4D, and normal video understanding. It improves over the no-tool baseline on all six backbones examined. Backbones vary from 26B to 397B parameters throughout the Qwen3.5/3.6 and Gemma4 households.

A managed comparability isolates the interface. All three variants share the similar toolset and immediate. Only the motion interface differs.

Action interface	Avg. (20 bench.)	Δ vs no-tool
No-tool baseline	53.4	–
Single-pass code	55.2	+1.8
Structured tool-call	56.7	+3.3
SpatialClaw (code as motion)	59.9	+6.5

Gemma4-31B spine, 20-benchmark common.

Against prior spatial brokers on the similar Gemma4-31B spine, the hole widens.

Method	Interface	Avg.	Δ vs SpatialClaw
VADAR	Single-pass	40.5*	−19.4
pySpatial	Single-pass	47.8	−12.1
SpaceTools-Toolshed	Structured tool-call	48.7	−11.2
SpatialClaw	Code as motion	59.9	finest

VADAR doesn’t assist video or multi-image inputs; solely single-image benchmarks are averaged.

The largest beneficial properties land on dynamic duties. On Gemma4-31B, DSI-Bench rose +17.6 factors and MindCube rose +15.3 factors. These classes want chained geometric computation throughout frames and viewpoints.

An LLM-as-judge attribution explains the wins over structured tool-call. Code composition accounts for 52.2% of them. Control move accounts for 19.5%, and the remaining 28.3% are interface-neutral.

Inside the Five-Stage Loop

Each pattern runs a five-stage loop: planning, code technology, code execution, suggestions meeting, and reply submission. A planner drafts a method with out seeing the pictures. The predominant agent then writes one Python cell per step. A static AST checker rejects unsafe code earlier than execution. The loop repeats till ReturnAnswer() is known as or 30 steps move.

The official repo runs on a LangGraph workflow and a persistent Jupyter kernel. Backbones serve by way of vLLM. Perception runs behind a FastAPI GPU service. A single quickstart runs one benchmark on one machine:

Copy Code

git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.instance .env        # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run 
    --dataset spatial_agent/config/dataset/erqa.json 
    --model   spatial_agent/config/mannequin/gemini-3-pro.json 
    --concurrency 4

A consultant agent cell composes notion with geometry, then revises:

Copy Code

# Reconstruct the scene, then phase each objects in a single video move
recon = instruments.Reconstruct.Reconstruct(InputImages)
seg = instruments.SAM3.segment_video_by_text(["radiator heater", "door"])
present(seg.visualize(1))                         # examine the masks first

# Closest-point distance through KD-tree, not centroids
pts_h = seg.get_masked_points(recon, body=1, object=0)   # object 0 = heater
pts_d = seg.get_masked_points(recon, body=2, object=1)   # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).question(pts_h, okay=1)
ReturnAnswer(float(dists.min()))

The agent picks primitives from the query itself. Distance questions invoke KD-tree search and vector norms. Direction questions depend on dot merchandise. No category-specific routing was utilized.

Use Cases

The design matches issues that want step-by-step geometric reasoning. Concrete examples embrace:

Robotics and embodied brokers that measure metric distances between objects earlier than appearing.
Multi-view inspection, the place an object’s dealing with route is recovered from a number of digicam angles.
Video and 4D evaluation that tracks object or digicam movement throughout frames.
Indoor scene query answering, such as “the place is the door relative to the sink?”

Because it’s training-free, groups can lengthen a deployed VLM with out new knowledge or fine-tuning.

Interactive Explainer

Back</button>
<button class=”c major” id=”sc-next”>Run subsequent step

</button>
<button class=”c” id=”sc-reset”>Reset</button>

</div>

<div class=”foot”>
Faithful to the paper’s walkthrough · interface logic is illustrative
Built for Marktechpost · verified Jun 2026
</div>
</div>

// — step knowledge, trustworthy to Figure 2 of the SpatialClaw paper —
var DATA={
single:{
label:”single-pass · no persistence”,
stateNote:”No intermediate state. One full program is dedicated earlier than any execution suggestions is seen.”,
vars:[],
steps:[{
think:”Write one complete program now, before seeing any mask, depth map, or error.”,
code:'# commit the full analysis up frontnrecon = tools.Reconstruct(frames_for_recon)nseg_heater = tools.SAM3(img_heater, ”white radiator heater”)nimg_door = InputImages[2]n# … compute centroid distance …nelse:n ReturnAnswer(”Could not decide distance”)’,
fb:’Single run full · no probability to examine or revise.’,
closing:true, reply:”1.638″, appropriate:false,
why:”Wrong — the validity of the masks was by no means checked.”
}]
},
struct:{
label:”named outcomes solely”,
stateNote:”Each step binds one named consequence. They can’t be freely composed with NumPy or SciPy at check time.”,
vars:[],
steps:[
{think:”Reconstruct the scene.”,
code:'{”tool”: ”Reconstruct”,n ”args”: {”image”: ”InputImages[1]”}}’,
fb:'saved → result_1 : Reconstruction’,
addVars:[{n:”result_1″,t:”Reconstruction”}]},
{suppose:”Segment the heater.”,
code:'{”software”: ”SAM3″,n ”args”: {”picture”: ”InputImages[1]”, ”immediate”: ”heater”}}’,
fb:'saved → result_2 : masks’,
addVars:[{n:”result_2″,t:”mask”}]},
{suppose:”Segment the door.”,
code:'{”software”: ”SAM3″,n ”args”: {”picture”: ”InputImages[2]”, ”immediate”: ”door”}}’,
fb:'saved → result_3 : masks’,
addVars:[{n:”result_3″,t:”mask”}]},
{suppose:”Measure distance with the predefined software, between the two centroids.”,
code:'{”software”: ”compute_dist”,n ”args”: {”seg_1″: ”result_2″, ”seg_2″: ”result_3″}}’,
fb:’Distance: 6.5 · no software exists for the closest level’,
addVars:[{n:”result_4″,t:”float = 6.5″}]},
{suppose:”No registered software returns the closest level. Submit the centroid distance.”,
code:'{”software”: ”ReturnAnswer”, ”args”: {”worth”: 6.5}}’,
fb:’reply submitted’,
closing:true, reply:”6.5″, appropriate:false,
why:”Wrong — the schema has no software for deriving the closest level.”}
]
},
claw:{
label:”persistent python kernel”,
stateNote:”Every object stays a stay Python variable. The agent composes, inspects, and revises throughout steps.”,
vars:[],
steps:[
{think:”Reconstruct the scene, segment both objects, and render the masks to inspect them.”,
code:’seg_h = tools.SAM3(InputImages[1], ”radiator”)nseg_d = instruments.SAM3(InputImages[2], ”door”)nrecon = instruments.Reconstruct(InputImages[::100])npresent(seg_h, seg_d) # visible suggestions’,
fb:'Status: Success · present() → masks overlay rendered for inspection’,
addVars:[{n:”seg_h”,t:”PerFrameMask”},{n:”seg_d”,t:”PerFrameMask”},{n:”recon”,t:”Reconstruction”}]},
{suppose:”Masks verified. Compute every object’s 3D centroid and the distance between them.”,
code:’points_h = recon.factors[seg_h]npoints_d = recon.factors[seg_d]ncentroid_h = instruments.get_centroid(points_h)ncentroid_d = instruments.get_centroid(points_d)ndist = instruments.dist(centroid_h, centroid_d)nprint(f”Distance: {dist}”)’,
fb:'Status: Success · Distance: 1.4807′,
addVars:[{n:”points_h”,t:”ndarray (12361,3)”},{n:”points_d”,t:”ndarray”},{n:”centroid_h”,t:”ndarray (3,)”},{n:”centroid_d”,t:”ndarray (3,)”},{n:”dist”,t:”float = 1.4807″}]},
{revise:true,
suppose:”Wait — instruments.get_centroid makes use of the median, which isn’t the closest level. Switch to scipy.spatial.KDTree.”,
code:’tree_d = scipy.spatial.KDTree(points_d)ndists, _ = tree_d.question(points_h, okay=1)nmin_dist = dists.min()nprint(f”Min distance: {min_dist}”)’,
fb:'Status: Success · Min distance: 0.9439′,
addVars:[{n:”tree_d”,t:”KDTree”},{n:”dists”,t:”ndarray (12361,)”},{n:”min_dist”,t:”float = 0.9439″}]},
{suppose:”Closest-point distance cross-checked towards the visible proof. Submit it.”,
code:'ReturnAnswer(0.9439)’,
fb:’reply submitted’,
closing:true, reply:”0.9439″, appropriate:true,
why:”Correct — composition + intermediate verification reached the closest level.”}
]
}
};

var cur=”single”, idx=0;

perform esc(t){return t}
perform render(){
var d=DATA[cur];
var stream=$(‘#sc-stream’); stream.innerHTML=””;
var vars=[];
for(var i=0;i<=idx && i<d.steps.size;i++){
var s=d.steps[i];
if(s.addVars)vars=vars.concat(s.addVars);
var el=doc.createElement(‘div’); el.className=’step’;
var tcls=’suppose’+(s.revise?’ revise’:”);
var badge=s.revise?’REVISE’:(s.closing?’SUBMIT’:’STEP ‘+(i+1));
el.innerHTML='<div class=”‘+tcls+'”>’+badge+'’+s.suppose+'</div>’+
‘<pre>’+s.code+'</pre>’+
‘<div class=”fb”>’+s.fb+'</div>’;
stream.appendChild(el);
}
// state panel
$(‘#sc-statelbl’).textContent=d.label;
var vb=$(‘#sc-vars’);
if(cur===’single’){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else if(vars.size===0){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else{
vb.innerHTML='<div class=”empty” model=”margin-bottom:9px”>’+d.stateNote+'</div>’+
vars.map(perform(v){return ‘<div class=”var”>’+v.n+'’+v.t+'</div>’}).be part of(”);
}
// verdict
var vdt=$(‘#sc-verdict’);
var final=d.steps[Math.min(idx,d.steps.length-1)];
if(idx>=d.steps.length-1 && final.closing){
vdt.className=’verdict present ‘+(final.appropriate?’good’:’dangerous’);
vdt.querySelector(‘.mark’).textContent=final.appropriate?’✓’:’✗’;
$(‘#sc-vtxt’).innerHTML='Submitted reply: ‘+final.reply+(final.appropriate?’ m’:”)+'’+
‘’+final.why+'’;
}else{ vdt.className=’verdict’; }
// controls
$(‘#sc-prev’).disabled=(idx<=0);
$(‘#sc-next’).disabled=(idx>=d.steps.length-1);
$(‘#sc-next’).textContent=(idx>=d.steps.length-1)?’Done’:’Run subsequent step ‘;
$(‘#sc-prog’).textContent=’step ‘+(idx+1)+’ / ‘+d.steps.size;
resize();
}

perform setTab(okay){
cur=okay; idx=0;
root.querySelectorAll(‘.tab’).forEach(perform(t){
t.classList.toggle(‘on’,t.getAttribute(‘data-k’)===okay);
});
render();
}

$(‘#sc-tabs’).addEventListener(‘click on’,perform(e){
var t=e.goal.closest(‘.tab’); if(!t)return; setTab(t.getAttribute(‘data-k’));
});
$(‘#sc-next’).addEventListener(‘click on’,perform(){
if(idx<DATA[cur].steps.length-1){idx++;render();}
});
$(‘#sc-prev’).addEventListener(‘click on’,perform(){
if(idx>0){idx–;render();}
});
$(‘#sc-reset’).addEventListener(‘click on’,perform(){idx=0;render();});

// auto-resize for WordPress iframe embedding
perform resize(){
attempt{
var h=root.offsetHeight+40;
if(window.guardian && window.guardian!==window){
window.guardian.postMessage({sort:’sc-resize’,top:h},’*’);
}
}catch(e){}
}
window.addEventListener(‘load’,resize);
window.addEventListener(‘resize’,resize);

render();
})();
</script>
“>

Key Takeaways

Code as the motion interface: SpatialClaw lets a VLM write one Python cell per step right into a persistent kernel, composing and revising notion outputs as an alternative of committing to a set plan.
State of the artwork, training-free: 59.9% common throughout 20 spatial benchmarks, +11.2 factors over the prior agent SpaceTools, with no benchmark- or model-specific tuning.
The interface is the lever: swapping solely the motion interface on Gemma4-31B strikes accuracy from 56.7 (structured tool-call) to 59.9, and 52.2% of wins hint to code composition.
Biggest beneficial properties the place geometry chains: dynamic 4D and multi-view duties lead the lifts (DSI-Bench +17.6, MindCube +15.3), the place steps should compose throughout frames and viewpoints.
Perception is the ceiling: beneficial properties switch throughout six backbones (26B–397B), however the remaining bottleneck is notion high quality, and the license is non-commercial.

Check out the Paper, Project and Repo. Also, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning appeared first on MarkTechPost.

NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning

What is SpatialClaw

Why the Action Interface Matters

Benchmark

Inside the Five-Stage Loop

Use Cases

Interactive Explainer

Key Takeaways

Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer

Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development

Forget Keyword Imitation: ByteDance AI Maps Molecular Bonds in AI Reasoning to Stabilize Long Chain-of-Thought Performance and Reinforcement Learning (RL) Training

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document Intelligence

How access models are shaping AI cybersecurity deployment

NVIDIA AI Releases Dynamo Snapshot: A CRIU-Based Fast Startup System for AI Inference on Kubernetes

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What is SpatialClaw

Why the Action Interface Matters

Benchmark

Inside the Five-Stage Loop

Use Cases

Interactive Explainer

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!