Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents

Large language mannequin brokers are beginning to retailer all the pieces they see, however can they really enhance their insurance policies at check time from these experiences moderately than simply replaying context home windows?

Researchers from University of Illinois Urbana Champaign and Google DeepMind suggest Evo-Memory, a streaming benchmark and agent framework that targets this actual hole. Evo-Memory evaluates test-time studying with self-evolving reminiscence, asking whether or not brokers can accumulate and reuse methods from steady process streams as an alternative of relying solely on static conversational logs.

Conversational Recall vs Experience Reuse

Most present brokers implement conversational recall. They retailer dialogue historical past, instrument traces, and retrieved paperwork, that are then reintegrated into the context window for future queries. This sort of reminiscence serves as a passive buffer, able to recovering details or recalling earlier steps, but it surely doesn’t actively modify the agent’s method for associated duties.

Evo-Memory as an alternative focuses on expertise reuse. Here every interplay is handled as an expertise that encodes not solely inputs and outputs, but additionally whether or not a process succeeded and which methods have been efficient. The benchmark checks if brokers can retrieve these experiences in later duties, apply them as reusable procedures, and refine the reminiscence over time.

Benchmark Design and Task Streams

The analysis crew formalizes a reminiscence augmented agent as a tuple ((F, U, R, C)). The base mannequin (F) generates outputs. The retrieval module (R) searches a reminiscence retailer. The context constructor (C) synthesizes a working immediate from the present enter and retrieved objects. The replace perform (U) writes new expertise entries and evolves the reminiscence after each step.

Evo-Memory restructures typical benchmarks into sequential process streams. Each dataset turns into an ordered sequence of duties the place early objects carry methods which might be helpful for later ones. The suite covers AIME 24, AIME 25, GPQA Diamond, MMLU-Pro economics, engineering, philosophy, and ToolBench for instrument use, together with multi flip environments from AgentBoard together with AlfWorld, BabyAI, ScienceWorld, Jericho, and PDDL planning.

Evaluation is finished alongside 4 axes. Single flip duties use actual match or reply accuracy. Embodied environments report success price and progress price. Step effectivity measures common steps per profitable process. Sequence robustness checks whether or not efficiency is steady when process order adjustments.

ExpRAG, a Minimal Experience Reuse Baseline

To set a decrease sure, the analysis crew outline ExpRAG. Each interplay turns into a structured expertise textual content with template ⟨x_i,y_i^{^},f_i⟩the place x_i is enter, y_i^{^} is mannequin output and f_i is suggestions, for instance a correctness sign. At a brand new step (t), the agent retrieves comparable experiences from reminiscence utilizing a similarity rating and concatenates them with the present enter as in-context examples. Then it appends the brand new expertise into reminiscence.

ExpRAG doesn’t change the agent management loop. It remains to be a single shot name to the spine, however now augmented with explicitly saved prior duties. The design is deliberately easy in order that any features on Evo-Memory may be attributed to process stage expertise retrieval, to not new planning or instrument abstractions.

ReMem, Action Think Memory Refine

The most important contribution on the agent facet is ReMem, an motion–suppose–reminiscence refine pipeline constructed on prime of the identical spine fashions. At every inside step, given the present enter, reminiscence state and previous reasoning traces, the agent chooses certainly one of three operations:

Think generates intermediate reasoning traces that decompose the duty.
Act emits an atmosphere motion or ultimate reply seen to the consumer.
Refine performs meta reasoning on reminiscence by retrieving, pruning and reorganizing expertise entries.

This loop induces a Markov resolution course of the place the state consists of the question, present reminiscence and ongoing ideas. Within a step the agent can interleave a number of Think and Refine operations, and the step terminates when an Act operation is issued. In distinction to straightforward ReAct type brokers, reminiscence is now not a set buffer. It turns into an express object that the agent causes about and edits throughout inference.

Results on Reasoning, Tools and Embodied Environments

The analysis crew instantiate all strategies on Gemini 2.5 Flash and Claude 3.7 Sonnet beneath a unified search–predict–evolve protocol. This isolates the impact of reminiscence structure, since prompting, search and suggestions are held fixed throughout baselines.

On single flip benchmarks, evolving reminiscence strategies produce constant however reasonable features. For Gemini 2.5 Flash, ReMem reaches common actual match 0.65 throughout AIME 24, AIME 25, GPQA Diamond and MMLU Pro subsets, and 0.85 and 0.71 API and accuracy on ToolBench. ExpRAG additionally performs strongly, with common 0.60, and outperforms a number of extra advanced designs reminiscent of Agent Workflow Memory and Dynamic Cheatsheet variants.

The influence is bigger in multi flip environments. On Claude 3.7 Sonnet, ReMem reaches success and progress 0.92 and 0.96 on AlfWorld, 0.73 and 0.83 on BabyAI, 0.83 and 0.95 on PDDL and 0.62 and 0.89 on ScienceWorld, giving common 0.78 success and 0.91 progress throughout datasets. On Gemini 2.5 Flash, ReMem achieves common 0.50 success and 0.64 progress, bettering over historical past and ReAct type baselines in all 4 environments.

Step effectivity can also be improved. In AlfWorld, common steps to finish a process drop from 22.6 for a historical past baseline to 11.5 for ReMem. Lightweight designs reminiscent of ExpRecent and ExpRAG scale back steps as effectively, which signifies that even easy process stage expertise reuse could make behaviour extra environment friendly with out architectural adjustments to the spine.

An extra evaluation hyperlinks features to process similarity inside every dataset. Using embeddings from the retriever encoder, the analysis crew compute common distance from duties to their cluster heart. ReMem’s margin over a historical past baseline correlates strongly with this similarity measure, with reported Pearson correlation about 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet. Structured domains reminiscent of PDDL and AlfWorld present bigger enhancements than various units like AIME 25 or GPQA Diamond.

Key Takeaways

Evo-Memory is a complete streaming benchmark that converts normal datasets into ordered process, so brokers can retrieve, combine and replace reminiscence over time moderately than depend on static conversational recall.
The framework formalizes reminiscence augmented brokers as a tuple ((F, U, R, C)) and implements greater than 10 consultant reminiscence modules, together with retrieval based mostly, workflow and hierarchical recollections, evaluated on 10 single flip and multi flip datasets throughout reasoning, query answering, instrument use and embodied environments.
ExpRAG supplies a minimal expertise reuse baseline that shops every process interplay as a structured textual content report with enter, mannequin output and suggestions, then retrieves comparable experiences as in context exemplars for new duties, already giving constant enhancements over pure historical past based mostly baselines.
ReMem extends the usual ReAct type loop with an express Think, Act, Refine Memory management cycle, which lets the agent actively retrieve, prune and reorganize its reminiscence throughout inference, resulting in larger accuracy, larger success price and fewer steps on each single flip reasoning and lengthy horizon interactive environments.
Across Gemini 2.5 Flash and Claude 3.7 Sonnet backbones, self evolving recollections reminiscent of ExpRAG and particularly ReMem make smaller fashions behave like stronger brokers at check time, bettering actual match, success and progress metrics with none retraining of base mannequin weights.

Editorial Notes

Evo Memory is a helpful step for evaluating self evolving reminiscence in LLM brokers. It forces fashions to function on sequential process streams as an alternative of remoted prompts. It compares greater than 10 reminiscence architectures beneath a single framework. Simple strategies like ExpRAG already present clear features. ReMem’s motion, suppose, refine reminiscence loop improves actual match, success and progress with out retraining base weights. Overall, this analysis work makes check time evolution a concrete design goal for LLM agent techniques

Check out the Paper. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents appeared first on MarkTechPost.

Google DeepMind Researchers Introduce Evo-Memory Benchmark and ReMem Framework for Experience Reuse in LLM Agents