Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

ByRicardo March 4, 2026

Current end-to-end robotic policies, specifically Vision-Language-Action (VLA) models, typically operate on a single observation or a very short history. This ‘lack of memory’ makes long-horizon tasks, such as cleaning a kitchen or following a complex recipe, computationally intractable or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have introduced Multi-Scale Embodied Memory (MEM).

The Dual-Scale Memory Architecture

MEM factorizes robotic memory into two distinct scales to balance semantic context with real-time control constraints.

(1) Short-Term Video Memory

For tasks requiring fine-grained spatial awareness—like resolving self-occlusions or adapting a grasp—dense visual data is required. MEM utilizes an efficient video encoder that extends standard Vision Transformers (ViTs). To maintain real-time inference (the 380ms ‘real-time barrier’), the architecture avoids joint attention over all patches. Instead, it uses Space-Time Separable Attention, interleaving spatial attention within frames with causal-temporal attention across frames every fourth layer.

The computational complexity is reduced from O(n²K²) to O(Kn²+nK²), where n is the number of spatial patches and K is the number of timesteps. By dropping tokens from past timesteps in upper layers, the model passes only the current observation’s representation to the VLA backbone, keeping the token count invariant compared to single-frame models.

(2) Long-Term Language Memory

To handle tasks spanning up to 15 minutes, MEM uses a language-based representation for semantic events. The system decomposes the action prediction as:

$$pi(a_{t:t+H},l_{t+1},m_{t+1}|o_{t-T:t},m_{t},g) approxpi_{LL}(a_{t:t+H}|o_{t-K:t},l_{t+1},g)pi_{HL}(l_{t+1},m_{t+1}|o_{t},m_{t},g)$$

Here, a high-level policy (π_HL₎ maintains a running language summary (m_t) of past events and generates subtask instructions (l_t+1) for a low-level policy (π_LL). This language memory is trained using LLM-generated summaries that compress information (e.g., ‘I placed three bowls’ instead of individual attributes), reducing the risk of training-inference distribution shifts.

Implementation and Performance

The research team integrated MEM into the π_0.6 VLA, which is initialized from a pre-trained Gemma 3-4B model. The model was pre-trained on a diverse mixture of robot demonstrations, vision-language tasks, and internet video data.

Key Results:

In-Context Adaptation: MEM enables robots to adapt manipulation strategies based on recent failures. In evaluation, this led to a +62% success rate increase in opening refrigerators with unknown hinge directions and a +11% increase in picking up chopsticks at variable heights.
Long-Horizon Tasks: The model successfully performed 15-minute tasks like ‘Recipe Setup’ (retrieving ingredients from multiple locations) and ‘Kitchen Cleaning’ (washing dishes and wiping counters). Memory-less VLAs failed these tasks significantly more often.
Efficiency: The video encoder allows the model to process up to 16 observation frames (spanning ~1 minute) while remaining under critical real-time inference thresholds on a single NVIDIA H100 GPU.

MEM demonstrates that combining dense, short-term visual tokens with compressed, long-term language summaries allows VLAs to scale their ‘working memory’ without incurring prohibitive computational costs.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks appeared first on MarkTechPost.

Artificial Intelligence Computer Vision

How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision?
ByRicardo September 24, 2025

In this tutorial, we discover superior laptop imaginative and prescient strategies utilizing TorchVision’s v2 transforms, fashionable augmentation methods, and highly effective coaching enhancements. We stroll by means of the method of constructing an augmentation pipeline, making use of MixUp and CutMix, designing a contemporary CNN with consideration, and implementing a sturdy coaching loop. By operating…

Read More How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision?
Agentic AI AI Agents

How to Build a Robust Multi-Agent Pipeline Using CAMEL with Planning, Web-Augmented Reasoning, Critique, and Persistent Memory
ByRicardo December 30, 2025

In this tutorial, we build an advanced, end-to-end multi-agent research workflow using the CAMEL framework. We design a coordinated society of agents, Planner, Researcher, Writer, Critic, and Finalizer, that collaboratively transform a high-level topic into a polished, evidence-grounded research brief. We securely integrate the OpenAI API, orchestrate agent interactions programmatically, and add lightweight persistent memory…

Read More How to Build a Robust Multi-Agent Pipeline Using CAMEL with Planning, Web-Augmented Reasoning, Critique, and Persistent Memory
AI in Action Artificial Intelligence

Rachel James, AbbVie: Harnessing AI for corporate cybersecurity
ByRicardo August 22, 2025August 22, 2025

Cybersecurity is within the midst of a recent arms race, and the highly effective weapon of alternative on this new period is AI. AI gives a basic double-edged sword: a robust defend for defenders and a potent new device for these with malicious intent. Navigating this complicated battleground requires a gradual hand and a deep…

Read More Rachel James, AbbVie: Harnessing AI for corporate cybersecurity
Agentic AI AI Agents

PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold
ByRicardo October 23, 2025

Pokee AI has open sourced PokeeResearch-7B, a 7B parameter deep analysis agent that executes full analysis loops, decomposes a question, points search and learn calls, verifies candidate solutions, then synthesizes a number of analysis threads into a ultimate response. The agent runs a analysis and verification loop. In analysis, it calls exterior instruments for internet…

Read More PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold
Agentic AI AI Infrastructure

Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning
ByRicardo April 3, 2026April 3, 2026

In this tutorial, we construct a whole end-to-end pipeline utilizing NVIDIA Model Optimizer to practice, prune, and fine-tune a deep studying mannequin immediately in Google Colab. We begin by establishing the atmosphere and getting ready the CIFAR-10 dataset, then outline a ResNet structure and practice it to set up a robust baseline. From there, we…

Read More Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning
Agentic AI AI Shorts

Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding
ByRicardo May 20, 2026

Google simply launched Gemini 3.5 Flash at Google I/O May, 2026. It is the primary Gemini 3.5 mannequin. The collection combines frontier intelligence with motion. Google calls it a serious leap for clever brokers. The Flash tier has traditionally been sooner and cheaper. 3.5 Flash outperforms Gemini 3.1 Pro on difficult benchmarks. The earlier premium…

Read More Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

The Dual-Scale Memory Architecture

(1) Short-Term Video Memory

(2) Long-Term Language Memory

Implementation and Performance

Key Results:

How to Master Advanced TorchVision v2 Transforms, MixUp, CutMix, and Modern CNN Training for State-of-the-Art Computer Vision?

How to Build a Robust Multi-Agent Pipeline Using CAMEL with Planning, Web-Augmented Reasoning, Critique, and Persistent Memory

Rachel James, AbbVie: Harnessing AI for corporate cybersecurity

PokeeResearch-7B: An Open 7B Deep-Research Agent Trained with Reinforcement Learning from AI Feedback (RLAIF) and a Robust Reasoning Scaffold

Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning

Google Introduces Gemini 3.5 Flash at I/O 2026: A Faster and Cheaper Model for AI Agents and Coding

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

The Dual-Scale Memory Architecture

(1) Short-Term Video Memory

(2) Long-Term Language Memory

Implementation and Performance

Key Results:

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!