NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning
NVIDIA Research has launched SpatialClaw, a training-free framework for spatial reasoning. It targets a persistent weak point in vision-language fashions (VLMs). These fashions nonetheless battle to evaluate the place objects are, how they relate, and the way they transfer in 3D.
SpatialClaw doesn’t retrain the mannequin. Instead, it adjustments the motion interface the agent makes use of to name notion instruments. The analysis group argues the interface is the bottleneck. Their resolution is to deal with code as the motion interface. Across 20 benchmarks, SpatialClaw reaches 59.9% common accuracy. It outperforms the current spatial agent SpaceTools by 11.2 factors.
What is SpatialClaw
SpatialClaw is an agent loop wrapped round a stateful Python kernel. The kernel is pre-loaded with enter frames and a set of primitives. Perception instruments are plain Python callables. Their outputs, together with masks, depth maps, digicam geometry, and trajectories, are extraordinary Python variables.
The kernel exposes six public entry factors. InputImages holds the sampled frames. Metadata carries body fee, period, and body indices. instruments exposes notion and geometry primitives. present() embeds a picture into the agent’s subsequent context. vlm dispatches queries to a separate VLM session. ReturnAnswer() submits the closing reply.
Two notion instruments are central. instruments.Reconstruct wraps Depth Anything 3 and returns per-frame depth, digicam intrinsics, extrinsics, and dense level maps. instruments.SAM3 wraps SAM 3 and produces picture or video masks from textual content, level, or field prompts. The framework provides light-weight utilities: instruments.Geometry, instruments.Mask, instruments.Time, instruments.Graph, and instruments.Draw.
It is training-free. The similar system immediate, software set, and hyperparameters run throughout each benchmark and spine.

Why the Action Interface Matters
The analysis group studied three motion interfaces on the similar query. Consider measuring the closest distance between a heater and a door.
- Single-pass code writes one full program and runs it as soon as. It commits to a full technique earlier than seeing any intermediate masks or depth map. A incorrect assumption then propagates straight to the reply.
- Structured tool-call invokes named instruments by way of a set JSON schema. It can not freely mix outputs with NumPy or SciPy to specific test-time computations. The closest-point operation has no pre-registered software, so the result’s incorrect.
- SpatialClaw composes instruments in code, inspects outcomes, then revises. It first computes a centroid distance, then notices the centroid makes use of a median. The agent switches to
scipy.spatial.KDTreeto search out the true closest level. It submits 0.9439 m towards a 0.9 m floor reality.
Benchmark
SpatialClaw was examined on 20 benchmarks throughout 5 classes. These span single-image, multi-view, normal, video and 4D, and normal video understanding. It improves over the no-tool baseline on all six backbones examined. Backbones vary from 26B to 397B parameters throughout the Qwen3.5/3.6 and Gemma4 households.
A managed comparability isolates the interface. All three variants share the similar toolset and immediate. Only the motion interface differs.
| Action interface | Avg. (20 bench.) | Δ vs no-tool |
|---|---|---|
| No-tool baseline | 53.4 | – |
| Single-pass code | 55.2 | +1.8 |
| Structured tool-call | 56.7 | +3.3 |
| SpatialClaw (code as motion) | 59.9 | +6.5 |
Gemma4-31B spine, 20-benchmark common.
Against prior spatial brokers on the similar Gemma4-31B spine, the hole widens.
| Method | Interface | Avg. | Δ vs SpatialClaw |
|---|---|---|---|
| VADAR | Single-pass | 40.5* | −19.4 |
| pySpatial | Single-pass | 47.8 | −12.1 |
| SpaceTools-Toolshed | Structured tool-call | 48.7 | −11.2 |
| SpatialClaw | Code as motion | 59.9 | finest |
The largest beneficial properties land on dynamic duties. On Gemma4-31B, DSI-Bench rose +17.6 factors and MindCube rose +15.3 factors. These classes want chained geometric computation throughout frames and viewpoints.
An LLM-as-judge attribution explains the wins over structured tool-call. Code composition accounts for 52.2% of them. Control move accounts for 19.5%, and the remaining 28.3% are interface-neutral.
Inside the Five-Stage Loop
Each pattern runs a five-stage loop: planning, code technology, code execution, suggestions meeting, and reply submission. A planner drafts a method with out seeing the pictures. The predominant agent then writes one Python cell per step. A static AST checker rejects unsafe code earlier than execution. The loop repeats till ReturnAnswer() is known as or 30 steps move.
The official repo runs on a LangGraph workflow and a persistent Jupyter kernel. Backbones serve by way of vLLM. Perception runs behind a FastAPI GPU service. A single quickstart runs one benchmark on one machine:
git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.instance .env # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run
--dataset spatial_agent/config/dataset/erqa.json
--model spatial_agent/config/mannequin/gemini-3-pro.json
--concurrency 4
A consultant agent cell composes notion with geometry, then revises:
# Reconstruct the scene, then phase each objects in a single video move
recon = instruments.Reconstruct.Reconstruct(InputImages)
seg = instruments.SAM3.segment_video_by_text(["radiator heater", "door"])
present(seg.visualize(1)) # examine the masks first
# Closest-point distance through KD-tree, not centroids
pts_h = seg.get_masked_points(recon, body=1, object=0) # object 0 = heater
pts_d = seg.get_masked_points(recon, body=2, object=1) # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).question(pts_h, okay=1)
ReturnAnswer(float(dists.min()))
The agent picks primitives from the query itself. Distance questions invoke KD-tree search and vector norms. Direction questions depend on dot merchandise. No category-specific routing was utilized.
Use Cases
The design matches issues that want step-by-step geometric reasoning. Concrete examples embrace:
- Robotics and embodied brokers that measure metric distances between objects earlier than appearing.
- Multi-view inspection, the place an object’s dealing with route is recovered from a number of digicam angles.
- Video and 4D evaluation that tracks object or digicam movement throughout frames.
- Indoor scene query answering, such as “the place is the door relative to the sink?”
Because it’s training-free, groups can lengthen a deployed VLM with out new knowledge or fine-tuning.
Interactive Explainer
<button class=”c major” id=”sc-next”>Run subsequent step
</button><button class=”c” id=”sc-reset”>Reset</button>
<span class=”prog” id=”sc-prog”></span>
</div>
<div class=”foot”>
<span>Faithful to the paper’s walkthrough · interface logic is illustrative</span>
<span>Built for <b>Marktechpost</b> · verified Jun 2026</span>
</div>
</div>
<script>
(perform(){
var root=doc.getElementById(‘sc-root’);
if(!root)return;
var $=perform(s){return root.querySelector(s)};
// — step knowledge, trustworthy to Figure 2 of the SpatialClaw paper —
var DATA={
single:{
label:”single-pass · no persistence”,
stateNote:”No intermediate state. One full program is dedicated earlier than any execution suggestions is seen.”,
vars:[],
steps:[{
think:”Write one complete program now, before seeing any mask, depth map, or error.”,
code:'<span class=”cm”># commit the full analysis up front</span>nrecon = tools.<span class=”fn”>Reconstruct</span>(frames_for_recon)nseg_heater = tools.<span class=”fn”>SAM3</span>(img_heater, <span class=”st”>”white radiator heater”</span>)nimg_door = InputImages[2]n<span class=”cm”># … compute centroid distance …</span>n<span class=”kw”>else</span>:n <span class=”fn”>ReturnAnswer</span>(<span class=”st”>”Could not decide distance”</span>)’,
fb:’Single run full · no probability to examine or revise.’,
closing:true, reply:”1.638″, appropriate:false,
why:”Wrong — the validity of the masks was by no means checked.”
}]
},
struct:{
label:”named outcomes solely”,
stateNote:”Each step binds one named consequence. They can’t be freely composed with NumPy or SciPy at check time.”,
vars:[],
steps:[
{think:”Reconstruct the scene.”,
code:'{<span class=”st”>”tool”</span>: <span class=”st”>”Reconstruct”</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”image”</span>: <span class=”st”>”InputImages[1]”</span>}}’,
fb:'<span class=”okay”>saved</span> → result_1 : Reconstruction’,
addVars:[{n:”result_1″,t:”Reconstruction”}]},
{suppose:”Segment the heater.”,
code:'{<span class=”st”>”software”</span>: <span class=”st”>”SAM3″</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”picture”</span>: <span class=”st”>”InputImages[1]”</span>, <span class=”st”>”immediate”</span>: <span class=”st”>”heater”</span>}}’,
fb:'<span class=”okay”>saved</span> → result_2 : masks’,
addVars:[{n:”result_2″,t:”mask”}]},
{suppose:”Segment the door.”,
code:'{<span class=”st”>”software”</span>: <span class=”st”>”SAM3″</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”picture”</span>: <span class=”st”>”InputImages[2]”</span>, <span class=”st”>”immediate”</span>: <span class=”st”>”door”</span>}}’,
fb:'<span class=”okay”>saved</span> → result_3 : masks’,
addVars:[{n:”result_3″,t:”mask”}]},
{suppose:”Measure distance with the predefined software, between the two centroids.”,
code:'{<span class=”st”>”software”</span>: <span class=”st”>”compute_dist”</span>,n <span class=”st”>”args”</span>: {<span class=”st”>”seg_1″</span>: <span class=”st”>”result_2″</span>, <span class=”st”>”seg_2″</span>: <span class=”st”>”result_3″</span>}}’,
fb:’Distance: 6.5 · no software exists for the <em>closest</em> level’,
addVars:[{n:”result_4″,t:”float = 6.5″}]},
{suppose:”No registered software returns the closest level. Submit the centroid distance.”,
code:'{<span class=”st”>”software”</span>: <span class=”st”>”ReturnAnswer”</span>, <span class=”st”>”args”</span>: {<span class=”st”>”worth”</span>: 6.5}}’,
fb:’reply submitted’,
closing:true, reply:”6.5″, appropriate:false,
why:”Wrong — the schema has no software for deriving the closest level.”}
]
},
claw:{
label:”persistent python kernel”,
stateNote:”Every object stays a stay Python variable. The agent composes, inspects, and revises throughout steps.”,
vars:[],
steps:[
{think:”Reconstruct the scene, segment both objects, and render the masks to inspect them.”,
code:’seg_h = tools.<span class=”fn”>SAM3</span>(InputImages[1], <span class=”st”>”radiator”</span>)nseg_d = instruments.<span class=”fn”>SAM3</span>(InputImages[2], <span class=”st”>”door”</span>)nrecon = instruments.<span class=”fn”>Reconstruct</span>(InputImages[::100])n<span class=”fn”>present</span>(seg_h, seg_d) <span class=”cm”># visible suggestions</span>’,
fb:'<span class=”okay”>Status: Success</span> · <span class=”present”>present() → masks overlay rendered for inspection</span>’,
addVars:[{n:”seg_h”,t:”PerFrameMask”},{n:”seg_d”,t:”PerFrameMask”},{n:”recon”,t:”Reconstruction”}]},
{suppose:”Masks verified. Compute every object’s 3D centroid and the distance between them.”,
code:’points_h = recon.factors[seg_h]npoints_d = recon.factors[seg_d]ncentroid_h = instruments.<span class=”fn”>get_centroid</span>(points_h)ncentroid_d = instruments.<span class=”fn”>get_centroid</span>(points_d)ndist = instruments.<span class=”fn”>dist</span>(centroid_h, centroid_d)n<span class=”fn”>print</span>(<span class=”st”>f”Distance: {dist}”</span>)’,
fb:'<span class=”okay”>Status: Success</span> · Distance: 1.4807′,
addVars:[{n:”points_h”,t:”ndarray (12361,3)”},{n:”points_d”,t:”ndarray”},{n:”centroid_h”,t:”ndarray (3,)”},{n:”centroid_d”,t:”ndarray (3,)”},{n:”dist”,t:”float = 1.4807″}]},
{revise:true,
suppose:”Wait — instruments.get_centroid makes use of the median, which isn’t the closest level. Switch to scipy.spatial.KDTree.”,
code:’tree_d = scipy.spatial.<span class=”fn”>KDTree</span>(points_d)ndists, _ = tree_d.<span class=”fn”>question</span>(points_h, okay=1)nmin_dist = dists.<span class=”fn”>min</span>()n<span class=”fn”>print</span>(<span class=”st”>f”Min distance: {min_dist}”</span>)’,
fb:'<span class=”okay”>Status: Success</span> · Min distance: 0.9439′,
addVars:[{n:”tree_d”,t:”KDTree”},{n:”dists”,t:”ndarray (12361,)”},{n:”min_dist”,t:”float = 0.9439″}]},
{suppose:”Closest-point distance cross-checked towards the visible proof. Submit it.”,
code:'<span class=”fn”>ReturnAnswer</span>(0.9439)’,
fb:’reply submitted’,
closing:true, reply:”0.9439″, appropriate:true,
why:”Correct — composition + intermediate verification reached the closest level.”}
]
}
};
var cur=”single”, idx=0;
perform esc(t){return t}
perform render(){
var d=DATA[cur];
var stream=$(‘#sc-stream’); stream.innerHTML=””;
var vars=[];
for(var i=0;i<=idx && i<d.steps.size;i++){
var s=d.steps[i];
if(s.addVars)vars=vars.concat(s.addVars);
var el=doc.createElement(‘div’); el.className=’step’;
var tcls=’suppose’+(s.revise?’ revise’:”);
var badge=s.revise?’REVISE’:(s.closing?’SUBMIT’:’STEP ‘+(i+1));
el.innerHTML='<div class=”‘+tcls+'”><span class=”badge”>’+badge+'</span>’+s.suppose+'</div>’+
‘<pre>’+s.code+'</pre>’+
‘<div class=”fb”>’+s.fb+'</div>’;
stream.appendChild(el);
}
// state panel
$(‘#sc-statelbl’).textContent=d.label;
var vb=$(‘#sc-vars’);
if(cur===’single’){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else if(vars.size===0){
vb.innerHTML='<div class=”empty”>’+d.stateNote+'</div>’;
}else{
vb.innerHTML='<div class=”empty” model=”margin-bottom:9px”>’+d.stateNote+'</div>’+
vars.map(perform(v){return ‘<div class=”var”><b>’+v.n+'</b><i>’+v.t+'</i></div>’}).be part of(”);
}
// verdict
var vdt=$(‘#sc-verdict’);
var final=d.steps[Math.min(idx,d.steps.length-1)];
if(idx>=d.steps.length-1 && final.closing){
vdt.className=’verdict present ‘+(final.appropriate?’good’:’dangerous’);
vdt.querySelector(‘.mark’).textContent=final.appropriate?’✓’:’✗’;
$(‘#sc-vtxt’).innerHTML='<b>Submitted reply: ‘+final.reply+(final.appropriate?’ m’:”)+'</b>’+
‘<small>’+final.why+'</small>’;
}else{ vdt.className=’verdict’; }
// controls
$(‘#sc-prev’).disabled=(idx<=0);
$(‘#sc-next’).disabled=(idx>=d.steps.length-1);
$(‘#sc-next’).textContent=(idx>=d.steps.length-1)?’Done’:’Run subsequent step
‘;
$(‘#sc-prog’).textContent=’step ‘+(idx+1)+’ / ‘+d.steps.size;
resize();
}
perform setTab(okay){
cur=okay; idx=0;
root.querySelectorAll(‘.tab’).forEach(perform(t){
t.classList.toggle(‘on’,t.getAttribute(‘data-k’)===okay);
});
render();
}
$(‘#sc-tabs’).addEventListener(‘click on’,perform(e){
var t=e.goal.closest(‘.tab’); if(!t)return; setTab(t.getAttribute(‘data-k’));
});
$(‘#sc-next’).addEventListener(‘click on’,perform(){
if(idx<DATA[cur].steps.length-1){idx++;render();}
});
$(‘#sc-prev’).addEventListener(‘click on’,perform(){
if(idx>0){idx–;render();}
});
$(‘#sc-reset’).addEventListener(‘click on’,perform(){idx=0;render();});
// auto-resize for WordPress iframe embedding
perform resize(){
attempt{
var h=root.offsetHeight+40;
if(window.guardian && window.guardian!==window){
window.guardian.postMessage({sort:’sc-resize’,top:h},’*’);
}
}catch(e){}
}
window.addEventListener(‘load’,resize);
window.addEventListener(‘resize’,resize);
render();
})();
</script>
“>
Key Takeaways
- Code as the motion interface: SpatialClaw lets a VLM write one Python cell per step right into a persistent kernel, composing and revising notion outputs as an alternative of committing to a set plan.
- State of the artwork, training-free: 59.9% common throughout 20 spatial benchmarks, +11.2 factors over the prior agent SpaceTools, with no benchmark- or model-specific tuning.
- The interface is the lever: swapping solely the motion interface on Gemma4-31B strikes accuracy from 56.7 (structured tool-call) to 59.9, and 52.2% of wins hint to code composition.
- Biggest beneficial properties the place geometry chains: dynamic 4D and multi-view duties lead the lifts (DSI-Bench +17.6, MindCube +15.3), the place steps should compose throughout frames and viewpoints.
- Perception is the ceiling: beneficial properties switch throughout six backbones (26B–397B), however the remaining bottleneck is notion high quality, and the license is non-commercial.
Check out the Paper, Project and Repo. Also, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The submit NVIDIA AI Introduce SpatialClaw: A Training-Free Agent That Treats Code as the Action Interface for Spatial Reasoning appeared first on MarkTechPost.
