A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction

In this tutorial, we stroll by MolmoAct step-by-step and construct a sensible understanding of how action-reasoning fashions can cause in area from visible observations. We arrange the surroundings, load the mannequin, put together multi-view picture inputs, and discover how MolmoAct produces depth-aware reasoning, visible traces, and actionable robotic outputs from pure language directions. As we transfer by the workflow, we run inference and additionally study how the mannequin parses actions, visualizes trajectories, and helps extra superior processing pipelines for robotics-oriented duties.

Copy Code

print("=" * 80)
print(" SECTION 1: INSTALLATION AND SETUP")
print("=" * 80)


import subprocess
import sys


def install_packages():
   """Install all required packages for MolmoAct"""
   packages = [
       "torch>=2.0.0",
       "torchvision",
       "transformers==4.52",
       "accelerate",
       "einops",
       "Pillow",
       "numpy",
       "matplotlib",
       "requests",
       "scipy",
       "huggingface_hub",
   ]
  
   for package deal in packages:
       print(f" Installing {package deal}...")
       subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
  
   print(" All packages put in efficiently!")


install_packages()


print("n" + "=" * 80)
print(" SECTION 2: IMPORTS AND CONFIGURATION")
print("=" * 80)


import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
from typing import List, Tuple, Dict, Optional, Union
import json
import time
from dataclasses import dataclass
import warnings
import re


warnings.filterwarnings("ignore", class=FutureWarning)
warnings.filterwarnings("ignore", class=UserWarning)


print(f"  Device: {gadget}")
if torch.cuda.is_available():
   print(f" GPU: {torch.cuda.get_device_name(0)}")
   print(f" GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


print("n" + "=" * 80)
print(" SECTION 3: MOLMOACT MODEL LOADER")
print("=" * 80)


@dataclass
class MolmoActConfig:
   """Configuration for MolmoAct mannequin"""
   model_name: str = "allenai/MolmoAct-7B-D-0812"
   torch_dtype: str = "bfloat16"
   device_map: str = "auto"
   max_new_tokens: int = 256
   temperature: float = 0.0
   do_sample: bool = False

We arrange the tutorial and ready the surroundings wanted to run MolmoAct in Google Colab. We set up all required packages, import the core libraries, and configure the runtime to detect whether or not GPU acceleration is offered. We additionally outline the bottom configuration class that shops the principle mannequin settings we use all through the remainder of the tutorial.

Copy Code

class MolmoActModel:
   """
   MolmoAct Model Wrapper for Easy Inference
  
   This class gives a high-level interface for:
   - Loading and managing the mannequin
   - Running inference with correct prompting
   - Parsing outputs (depth, hint, actions)
   - Batch processing
   """
  
   def __init__(self, config: Optional[MolmoActConfig] = None):
       self.config = config or MolmoActConfig()
       self.mannequin = None
       self.processor = None
       self._loaded = False
      
   def load(self) -> None:
       """Load the MolmoAct mannequin and processor"""
       if self._loaded:
           print(" Model already loaded!")
           return
          
       print(f" Loading MolmoAct mannequin: {self.config.model_name}")
       print("   This could take a couple of minutes on first run...")
      
       from transformers import AutoModelForImageTextual contentToTextual content, AutoProcessor
      
       dtype = getattr(torch, self.config.torch_dtype)
      
       print("    Loading mannequin weights...")
       self.mannequin = AutoModelForImageTextual contentToTextual content.from_pretrained(
           self.config.model_name,
           trust_remote_code=True,
           torch_dtype=dtype,
           device_map=self.config.device_map,
       )
      
       print("    Loading processor...")
       strive:
           self.processor = AutoProcessor.from_pretrained(
               self.config.model_name,
               trust_remote_code=True,
           )
           if hasattr(self.processor, 'tokenizer'):
               self.processor.tokenizer.padding_side = "left"
       besides TypeError as e:
           if "prompt_templates" in str(e):
               print("    Handling customized processor configuration...")
               from transformers.dynamic_module_utils import get_class_from_dynamic_module
              
               processor_class = get_class_from_dynamic_module(
                   "processing_molmoact.MolmoActProcessor",
                   self.config.model_name,
                   trust_remote_code=True,
               )
              
               from transformers import AutoTokenizer, AutoImageProcessor
              
               tokenizer = AutoTokenizer.from_pretrained(
                   self.config.model_name,
                   trust_remote_code=True,
                   padding_side="left",
               )
              
               image_processor = AutoImageProcessor.from_pretrained(
                   self.config.model_name,
                   trust_remote_code=True,
               )
              
               self.processor = processor_class(
                   image_processor=image_processor,
                   tokenizer=tokenizer,
               )
           else:
               increase e
      
       self._loaded = True
       print(" Model loaded efficiently!")
       self._print_model_info()
      
   def _print_model_info(self) -> None:
       """Print mannequin info"""
       total_params = sum(p.numel() for p in self.mannequin.parameters())
       trainable_params = sum(p.numel() for p in self.mannequin.parameters() if p.requires_grad)
       print(f"n Model Statistics:")
       print(f"   Total Parameters: {total_params / 1e9:.2f}B")
       print(f"   Trainable Parameters: {trainable_params / 1e9:.2f}B")
       print(f"   Model dtype: {subsequent(self.mannequin.parameters()).dtype}")
      
   def build_prompt(self, instruction: str) -> str:
       """
       Build the reasoning immediate for MolmoAct
      
       The immediate construction is essential for MolmoAct to generate:
       1. Depth notion tokens
       2. Visual trajectory hint
       3. Action predictions
       """
       immediate = (
           f"The activity is {instruction}. "
           "What is the motion that the robotic ought to take. "
           f"To determine the motion that the robotic ought to take to {instruction}, "
           "let's suppose by it step-by-step. "
           "First, what's the depth map for the primary picture? "
           "Second, what's the trajectory of the top effector within the first picture? "
           "Based on the depth map of the primary picture and the trajectory of the top effector within the first picture, "
           "together with different pictures from completely different digital camera views as further info, "
           "what's the motion that the robotic ought to take?"
       )
       return immediate

We start constructing the principle MolmoAct mannequin wrapper that makes inference simpler to handle. We load the mannequin and processor, deal with customized processor initialization logic, and print helpful mannequin statistics as soon as loading is full. We additionally outline a prompt-building technique that helps us construction the reasoning question to information the mannequin towards depth, hint, and motion technology.

Copy Code

   @torch.inference_mode()
   def generate(
       self,
       pictures: List[Image.Image],
       instruction: str,
       max_new_tokens: Optional[int] = None,
   ) -> Dict:
       """
       Generate motion reasoning from pictures and instruction
      
       Args:
           pictures: List of PIL Images
           instruction: Task instruction
           max_new_tokens: Override default max tokens
          
       Returns:
           Dictionary containing:
           - textual content: Generated reasoning textual content
           - depth: Parsed depth tokens
           - hint: Parsed visible hint coordinates
           - motion: Parsed motion values
       """
       if not self._loaded:
           increase RuntimeError("Model not loaded! Call .load() first.")
      
       immediate = self.build_prompt(instruction)
       max_tokens = max_new_tokens or self.config.max_new_tokens
      
       textual content = self.processor.apply_chat_template(
           [{"role": "user", "content": [dict(type="text", text=prompt)]}],
           tokenize=False,
           add_generation_prompt=True,
       )
      
       inputs = self.processor(
           pictures=[images],
           textual content=textual content,
           padding=True,
           return_tensors="pt",
       )
      
       inputs = {ok: v.to(self.mannequin.gadget) for ok, v in inputs.objects()}
      
       with torch.autocast("cuda", enabled=True, dtype=torch.bfloat16):
           generated_ids = self.mannequin.generate(
               **inputs,
               max_new_tokens=max_tokens,
               do_sample=self.config.do_sample,
           )
      
       generated_tokens = generated_ids[:, inputs['input_ids'].measurement(1):]
       generated_text = self.processor.batch_decode(
           generated_tokens,
           skip_special_tokens=True,
           clean_up_tokenization_spaces=False
       )[0]
      
       consequence = {
           "textual content": generated_text,
           "depth": self._safe_parse_depth(generated_text),
           "hint": self._safe_parse_trace(generated_text),
           "motion": self._safe_parse_action(generated_text, unnorm_key="molmoact"),
           "action_raw": self._safe_parse_action(generated_text, unnorm_key=None),
       }
      
       return consequence
  
   def _safe_parse_depth(self, textual content: str) -> List[str]:
       """Safely parse depth tokens from generated textual content"""
       strive:
           if hasattr(self.mannequin, 'parse_depth'):
               return self.mannequin.parse_depth(textual content)
       besides Exception:
           move
      
       depth_pattern = r'<DEPTH_START>.*?<DEPTH_END>'
       matches = re.findall(depth_pattern, textual content, re.DOTALL)
       return matches if matches else []
  
   def _safe_parse_trace(self, textual content: str) -> List[List[List[int]]]:
       """Safely parse visible hint coordinates from generated textual content"""
       strive:
           if hasattr(self.mannequin, 'parse_trace'):
               return self.mannequin.parse_trace(textual content)
       besides Exception:
           move
      
       coord_pattern = r'[(d+),s*(d+)]|((d+),s*(d+))'
       matches = re.findall(coord_pattern, textual content)
      
       traces = []
       current_trace = []
       for match in matches:
           x = int(match[0] or match[2])
           y = int(match[1] or match[3])
           if 0 <= x <= 256 and 0 <= y <= 256:
               current_trace.append([x, y])
      
       if current_trace:
           traces.append(current_trace)
      
       return traces
  
   def _safe_parse_action(self, textual content: str, unnorm_key: Optional[str] = None) -> List[List[float]]:
       """Safely parse motion values from generated textual content"""
       strive:
           if hasattr(self.mannequin, 'parse_action'):
               return self.mannequin.parse_action(textual content, unnorm_key=unnorm_key)
       besides Exception:
           move
      
       float_pattern = r'[-+]?d*.?d+(?:[eE][-+]?d+)?'
       all_floats = re.findall(float_pattern, textual content)
      
       actions = []
       floats = [float(f) for f in all_floats]
      
       for i in vary(len(floats) - 6):
           potential_action = floats[i:i+7]
           if all(-5 < v < 5 for v in potential_action[:6]):
               actions.append(potential_action)
               break
      
       return actions
  
   def batch_generate(
       self,
       batch_data: List[Tuple[List[Image.Image], str]],
       progress: bool = True
   ) -> List[Dict]:
       """
       Process a number of observations in batch
       """
       outcomes = []
       whole = len(batch_data)
      
       for i, (pictures, instruction) in enumerate(batch_data):
           if progress:
               print(f"r Processing {i+1}/{whole}...", finish="", flush=True)
          
           consequence = self.generate(pictures, instruction)
           outcomes.append(consequence)
      
       if progress:
           print(f"r Processed {whole} observations!")
      
       return outcomes


print("n" + "=" * 80)
print(" SECTION 4: VISUALIZATION UTILITIES")
print("=" * 80)

We implement the core technology pipeline that takes pictures and an instruction and produces structured reasoning outputs. We course of the inputs, run inference, decode the generated response, and extract depth, hint, and motion info from the mannequin output. We additionally add secure parsing strategies and batch-processing assist, enabling us to deal with a number of observations extra reliably and effectively.

Copy Code

class MolmoActVisualizer:
   """Visualization utilities for MolmoAct outputs"""
  
   def __init__(self, figsize: Tuple[int, int] = (12, 8)):
       self.figsize = figsize
       self.colours = plt.cm.viridis(np.linspace(0, 1, 10))
  
   def plot_trace(
       self,
       picture: Image.Image,
       hint: List[List[int]],
       title: str = "Visual Reasoning Trace",
       save_path: Optional[str] = None
   ) -> None:
       """Plot visible hint overlaid on picture"""
       fig, ax = plt.subplots(figsize=self.figsize)
      
       img_array = np.array(picture)
       ax.imshow(img_array)
      
       if hint and len(hint) > 0:
           h, w = img_array.form[:2]
           trace_array = np.array(hint)
          
           x_coords = trace_array[:, 0] * w / 256
           y_coords = trace_array[:, 1] * h / 256
          
           ax.plot(x_coords, y_coords, 'w-', linewidth=2, alpha=0.7)
           ax.plot(x_coords, y_coords, 'c-', linewidth=1, alpha=0.9)
          
           for i, (x, y) in enumerate(zip(x_coords, y_coords)):
               color_idx = int(i * 9 / max(len(x_coords) - 1, 1))
               ax.scatter(x, y, c=[self.colors[color_idx]], s=100,
                         edgecolors='white', linewidths=2, zorder=5)
               ax.annotate(f'{i+1}', (x, y), textcoords="offset factors",
                          xytext=(5, 5), fontsize=10, coloration='white',
                          fontweight='daring')
          
           ax.scatter(x_coords[0], y_coords[0], c='lime', s=200,
                     marker='o', edgecolors='white', linewidths=3,
                     zorder=6, label='Start')
           ax.scatter(x_coords[-1], y_coords[-1], c='purple', s=200,
                     marker='X', edgecolors='white', linewidths=3,
                     zorder=6, label='End')
      
       ax.set_title(title, fontsize=14, fontweight='daring')
       ax.axis('off')
       ax.legend(loc='higher proper')
      
       plt.tight_layout()
      
       if save_path:
           plt.savefig(save_path, dpi=150, bbox_inches='tight')
           print(f" Saved visualization to {save_path}")
      
       plt.present()
  
   def plot_action(
       self,
       motion: List[float],
       action_labels: Optional[List[str]] = None,
       title: str = "Predicted Robot Action",
       save_path: Optional[str] = None
   ) -> None:
       """Plot motion values as a bar chart"""
       if action_labels is None:
           action_labels = [
               'Δx (forward)', 'Δy (left)', 'Δz (up)',
               'Rx (roll)', 'Ry (pitch)', 'Rz (yaw)',
               'Gripper'
           ]
      
       fig, ax = plt.subplots(figsize=(10, 5))
      
       colours = ['#3498db', '#3498db', '#3498db',
                 '#e74c3c', '#e74c3c', '#e74c3c',
                 '#2ecc71']
      
       x = np.arange(len(motion))
       bars = ax.bar(x, motion, coloration=colours, edgecolor='white', linewidth=1.5)
      
       for bar, val in zip(bars, motion):
           top = bar.get_height()
           ax.annotate(f'{val:.3f}',
                      xy=(bar.get_x() + bar.get_width() / 2, top),
                      xytext=(0, 3 if top >= 0 else -12),
                      textcoords="offset factors",
                      ha='middle', va='backside' if top >= 0 else 'high',
                      fontsize=9, fontweight='daring')
      
       ax.set_xticks(x)
       ax.set_xticklabels(action_labels, rotation=45, ha='proper')
       ax.set_ylabel('Value', fontsize=12)
       ax.set_title(title, fontsize=14, fontweight='daring')
       ax.axhline(y=0, coloration='grey', linestyle='--', alpha=0.5)
       ax.grid(axis='y', alpha=0.3)
      
       from matplotlib.patches import Patch
       legend_elements = [
           Patch(facecolor='#3498db', label='Position'),
           Patch(facecolor='#e74c3c', label='Rotation'),
           Patch(facecolor='#2ecc71', label='Gripper')
       ]
       ax.legend(handles=legend_elements, loc='higher proper')
      
       plt.tight_layout()
      
       if save_path:
           plt.savefig(save_path, dpi=150, bbox_inches='tight')
      
       plt.present()

We create visualization utilities that assist us examine the mannequin’s reasoning outputs intuitively. We overlay predicted traces onto pictures and construct motion plots to raised perceive the mannequin’s spatial and management choices. We use these visible instruments to make the output simpler to interpret and analyze throughout experimentation.

Copy Code

 def plot_comparison(
       self,
       pictures: List[Image.Image],
       traces: List[List[List[int]]],
       titles: Optional[List[str]] = None,
       save_path: Optional[str] = None
   ) -> None:
       """Plot a number of pictures with their traces aspect by aspect"""
       n = len(pictures)
       fig, axes = plt.subplots(1, n, figsize=(5*n, 5))
      
       if n == 1:
           axes = [axes]
      
       for idx, (ax, img, hint) in enumerate(zip(axes, pictures, traces)):
           img_array = np.array(img)
           ax.imshow(img_array)
          
           if hint and len(hint) > 0:
               h, w = img_array.form[:2]
               trace_array = np.array(hint)
               x_coords = trace_array[:, 0] * w / 256
               y_coords = trace_array[:, 1] * h / 256
              
               ax.plot(x_coords, y_coords, 'c-', linewidth=2, alpha=0.9)
               ax.scatter(x_coords, y_coords, c='yellow', s=50,
                         edgecolors='white', linewidths=1, zorder=5)
          
           title = titles[idx] if titles else f"View {idx+1}"
           ax.set_title(title, fontsize=12, fontweight='daring')
           ax.axis('off')
      
       plt.tight_layout()
      
       if save_path:
           plt.savefig(save_path, dpi=150, bbox_inches='tight')
      
       plt.present()


print("n" + "=" * 80)
print(" SECTION 5: ACTION PROCESSING UTILITIES")
print("=" * 80)


class ActionProcessor:
   """Utilities for processing MolmoAct motion outputs"""
  
   DEFAULT_STATS = {
       "molmoact": {
           "imply": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],
           "std": [0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5],
       }
   }
  
   def __init__(self, stats: Optional[Dict] = None):
       self.stats = stats or self.DEFAULT_STATS
  
   def unnormalize(self, motion: List[float], key: str = "molmoact") -> np.ndarray:
       """Unnormalize motion values"""
       motion = np.array(motion)
      
       if key and key in self.stats:
           imply = np.array(self.stats[key]["mean"])
           std = np.array(self.stats[key]["std"])
           motion = motion * std + imply
      
       return motion
  
   def normalize(self, motion: np.ndarray, key: str = "molmoact") -> np.ndarray:
       """Normalize motion values"""
       motion = np.array(motion)
      
       if key and key in self.stats:
           imply = np.array(self.stats[key]["mean"])
           std = np.array(self.stats[key]["std"])
           motion = (motion - imply) / std
      
       return motion
  
   def process_gripper(self, motion: np.ndarray, threshold: float = 0.5) -> Tuple[np.ndarray, bool]:
       """Process gripper motion worth"""
       gripper_value = motion[-1]
       gripper_open = gripper_value > threshold
       return motion[:-1], gripper_open
  
   def smooth_actions(self, actions: List[np.ndarray], window_size: int = 3) -> List[np.ndarray]:
       """Smooth motion sequence utilizing shifting common"""
       if len(actions) < window_size:
           return actions
      
       actions_array = np.array(actions)
       smoothed = np.zeros_like(actions_array)
      
       for i in vary(len(actions)):
           begin = max(0, i - window_size // 2)
           finish = min(len(actions), i + window_size // 2 + 1)
           smoothed[i] = actions_array[start:end].imply(axis=0)
      
       return [smoothed[i] for i in vary(len(smoothed))]
  
   @staticmethod
   def action_to_pose_delta(motion: np.ndarray, scale: float = 1.0) -> Dict[str, np.ndarray]:
       """Convert motion to place and rotation deltas"""
       return {
           "position_delta": motion[:3] * scale,
           "rotation_delta": motion[3:6],
           "gripper": motion[6] if len(motion) > 6 else 1.0
       }


print("n" + "=" * 80)
print(" SECTION 6: EXAMPLE USAGE AND DEMO")
print("=" * 80)

We full the visualization module and then introduce utilities for processing predicted actions. We outline features for normalization, unnormalization, gripper-state dealing with, smoothing, and conversion of actions into pose deltas. We use these utilities to rework uncooked mannequin outputs into types which are extra helpful for robotics evaluation and downstream management.

Copy Code

def load_example_images() -> Tuple[Image.Image, Image.Image]:
   """Load instance pictures from HuggingFace"""
   print(" Loading instance pictures...")
  
   url1 = "https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/predominant/example_1.png"
   url2 = "https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/predominant/example_2.png"
  
   headers = {"User-Agent": "python-requests"}
  
   r1 = requests.get(url1, headers=headers, timeout=30)
   r1.raise_for_status()
   r2 = requests.get(url2, headers=headers, timeout=30)
   r2.raise_for_status()
  
   img1 = Image.open(BytesIO(r1.content material)).convert("RGB")
   img2 = Image.open(BytesIO(r2.content material)).convert("RGB")
  
   print(f" Loaded pictures: {img1.measurement} and {img2.measurement}")
  
   return img1, img2




def display_images(img1: Image.Image, img2: Image.Image) -> None:
   """Display the instance pictures"""
   fig, axes = plt.subplots(1, 2, figsize=(12, 5))
  
   axes[0].imshow(img1)
   axes[0].set_title("Side View (Exocentric)", fontsize=12, fontweight='daring')
   axes[0].axis('off')
  
   axes[1].imshow(img2)
   axes[1].set_title("Wrist View (Egocentric)", fontsize=12, fontweight='daring')
   axes[1].axis('off')
  
   plt.tight_layout()
   plt.present()




def run_demo():
   """
   Run the entire MolmoAct demo
   """
   print("n" + "=" * 80)
   print(" RUNNING MOLMOACT DEMO")
   print("=" * 80)
  
   img1, img2 = load_example_images()
   display_images(img1, img2)
  
   print("n Initializing MolmoAct...")
   config = MolmoActConfig(
       model_name="allenai/MolmoAct-7B-D-0812",
       torch_dtype="bfloat16",
       max_new_tokens=256,
   )
   mannequin = MolmoActModel(config)
  
   mannequin.load()
  
   instruction = "shut the field"
   print(f"n Task Instruction: '{instruction}'")
   print(" Generating motion reasoning...")
  
   start_time = time.time()
   consequence = mannequin.generate([img1, img2], instruction)
   inference_time = time.time() - start_time
  
   print(f"  Inference time: {inference_time:.2f}s")
  
   print("n" + "-" * 60)
   print(" GENERATED REASONING:")
   print("-" * 60)
   print(consequence['text'][:500] + "..." if len(consequence['text']) > 500 else consequence['text'])
  
   print("n" + "-" * 60)
   print(" PARSED OUTPUTS:")
   print("-" * 60)
  
   print(f"n Depth Tokens: {consequence['depth'][0][:50]}..." if consequence['depth'] else "No depth tokens")
   print(f"n Visual Trace: {consequence['trace']}")
   print(f"n Action (unnormalized): {consequence['action']}")
   print(f" Action (uncooked): {consequence['action_raw']}")
  
   print("n" + "-" * 60)
   print(" VISUALIZATIONS:")
   print("-" * 60)
  
   visualizer = MolmoActVisualizer()
  
   if consequence['trace'] and len(consequence['trace']) > 0:
       visualizer.plot_trace(
           img1,
           consequence['trace'][0],
           title=f"Visual Trace for: '{instruction}'"
       )
  
   if consequence['action'] and len(consequence['action']) > 0:
       visualizer.plot_action(
           consequence['action'][0],
           title=f"Predicted Action for: '{instruction}'"
       )
  
   print("n" + "-" * 60)
   print("  ACTION PROCESSING:")
   print("-" * 60)
  
   if consequence['action'] and len(consequence['action']) > 0:
       processor = ActionProcessor()
       motion = np.array(consequence['action'][0])
      
       pose_delta = processor.action_to_pose_delta(motion)
       print(f"n Position Delta: {pose_delta['position_delta']}")
       print(f" Rotation Delta: {pose_delta['rotation_delta']}")
       print(f" Gripper State: {'OPEN' if pose_delta['gripper'] > 0.5 else 'CLOSED'}")
  
   print("n" + "=" * 80)
   print(" DEMO COMPLETED!")
   print("=" * 80)
  
   return mannequin, consequence




print("n" + "=" * 80)
print(" SECTION 7: ADVANCED FEATURES")
print("=" * 80)


class MolmoActRollout:
   """Rollout controller for steady motion technology"""
  
   def __init__(
       self,
       mannequin: MolmoActModel,
       action_chunk_size: int = 8,
       smoothing_window: int = 3
   ):
       self.mannequin = mannequin
       self.action_chunk_size = action_chunk_size
       self.smoothing_window = smoothing_window
       self.processor = ActionProcessor()
       self.action_history = []
       self.reset()
  
   def reset(self):
       """Reset rollout state"""
       self.action_history = []
       self.step_count = 0
  
   def step(self, pictures: List[Image.Image], instruction: str) -> Dict:
       """Execute one step of the rollout"""
       consequence = self.mannequin.generate(pictures, instruction)
      
       if consequence['action'] and len(consequence['action']) > 0:
           motion = np.array(consequence['action'][0])
           self.action_history.append(motion)
           self.step_count += 1
          
           if len(self.action_history) >= self.smoothing_window:
               smoothed = self.processor.smooth_actions(
                   self.action_history[-self.smoothing_window:],
                   self.smoothing_window
               )[-1]
           else:
               smoothed = motion
          
           consequence['smoothed_action'] = smoothed
           consequence['step'] = self.step_count
      
       return consequence
  
   def get_action_statistics(self) -> Dict:
       """Get statistics of collected actions"""
       if not self.action_history:
           return {}
      
       actions = np.array(self.action_history)
      
       return {
           "imply": actions.imply(axis=0).tolist(),
           "std": actions.std(axis=0).tolist(),
           "min": actions.min(axis=0).tolist(),
           "max": actions.max(axis=0).tolist(),
           "num_steps": len(self.action_history)
       }




def demonstrate_custom_stats():
   """Demonstrate utilizing customized normalization statistics"""
   print("n" + "-" * 60)
   print(" CUSTOM STATISTICS DEMO")
   print("-" * 60)
  
   custom_stats = {
       "franka": {
           "imply": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],
           "std": [0.05, 0.05, 0.05, 0.3, 0.3, 0.3, 0.5],
       },
       "ur5": {
           "imply": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],
           "std": [0.08, 0.08, 0.08, 0.4, 0.4, 0.4, 0.5],
       }
   }
  
   processor = ActionProcessor(custom_stats)
  
   normalized_action = np.array([0.5, -0.3, 0.2, 0.1, -0.1, 0.05, 0.8])
  
   print("Normalized motion:", normalized_action)
   print("nUnnormalized for completely different robots:")
  
   for robotic in ["franka", "ur5"]:
       unnorm = processor.unnormalize(normalized_action, key=robotic)
       print(f"  {robotic}: {unnorm}")




print("n" + "=" * 80)
print(" SECTION 8: TIPS AND BEST PRACTICES")
print("=" * 80)


suggestions = """
╔══════════════════════════════════════════════════════════════════════════════╗
║                       MolmoAct Best Practices                                 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║    IMAGE INPUTS:                                                           ║
║  • Use 2 digital camera views: aspect (exocentric) + wrist (selfish)               ║
║  • Ensure good lighting and clear visibility                                ║
║  • Match digital camera setup to coaching distribution (Franka/DROID-like)         ║
║                                                                              ║
║   INSTRUCTIONS:                                                            ║
║  • Keep directions clear and concise                                       ║
║  • Use action-oriented language ("choose up", "push", "shut")                ║
║  • Avoid ambiguous references                                               ║
║                                                                              ║
║   PERFORMANCE:                                                              ║
║  • Use bfloat16 for sooner inference                                        ║
║  • Batch comparable observations when doable                                  ║
║  • Consider vLLM for manufacturing deployment                                  ║
║                                                                              ║
║   FINE-TUNING:                                                             ║
║  • Collect 50-100 demonstrations for new duties                              ║
║  • Use LoRA for environment friendly adaptation                                        ║
║  • Include depth notion in coaching knowledge                                ║
║                                                                              ║
║    SAFETY:                                                                  ║
║  • Always examine visible traces earlier than execution                            ║
║  • Implement pressure limits and collision detection                           ║
║  • Test in simulation earlier than real-world deployment                          ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝
"""


print(suggestions)


if __name__ == "__main__":
   print("n" + "=" * 80)
   print(" MOLMOACT ADVANCED TUTORIAL - MAIN EXECUTION")
   print("=" * 80)
  
   print("""
   This tutorial gives a complete information to MolmoAct.
  
   To run the total demo (requires GPU with ~16GB VRAM):
       mannequin, consequence = run_demo()
  
   To simply load pictures and discover:
       img1, img2 = load_example_images()
       display_images(img1, img2)
  
   For superior options:
       demonstrate_custom_stats()
  
   Happy robotics! 
   """)
  
   mannequin, consequence = run_demo()
  
   print("n Loading and displaying instance pictures...")
   strive:
       img1, img2 = load_example_images()
       display_images(img1, img2)
       print("n Images loaded! Uncomment 'run_demo()' to run full inference.")
   besides Exception as e:
       print(f" Could not load pictures: {e}")
       print("This is anticipated in environments with out web entry.")

We deliver every part collectively by instance picture loading, demo execution, rollout logic, and best-practice steerage. We run the end-to-end workflow, visualize outputs, course of predicted actions, and lengthen the setup to assist steady rollout and customized statistics. We conclude by presenting the principle execution block, which allows us to discover MolmoAct as an entire sensible pipeline for spatial reasoning and robotic motion technology.

In conclusion, we gained a complete, hands-on view of how MolmoAct can be utilized for spatial reasoning and motion technology in a structured, interpretable approach. We went past primary inference by visualizing traces, processing motion outputs, experimenting with rollout-style management, and understanding how the mannequin can match into broader simulation and robotics workflows. Through this end-to-end implementation, we noticed how MolmoAct brings collectively imaginative and prescient, reasoning, and motion prediction right into a single sensible pipeline that we will research, adapt, and lengthen for extra superior embodied AI functions.

Check out the Full Codes here. Also, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction appeared first on MarkTechPost.