|

MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters

Large language fashions grow to be static after pretraining. Their data doesn’t replace because the world adjustments. Retraining a full LLM is just too costly at trendy scales. Fine-tuning dangers degrading beforehand discovered data. Retrieval-augmented era (RAG) struggles when solutions require reasoning throughout many paperwork.

A crew of researchers from the National University of Singapore, MIT CSAIL, A*STAR, and the Singapore-MIT Alliance for Research and Technology (SMART) proposes a new strategy known as MEMO (Memory as a Model).

What Problem Does MEMO Solve?

Existing strategies for integrating new data into LLMs fall into three classes. Non-parametric strategies like RAG retrieve paperwork at inference time. They are delicate to retrieval noise and battle with cross-document reasoning. Parametric strategies similar to continuous pretraining or supervised fine-tuning internalize data into mannequin weights. They are computationally costly and trigger catastrophic forgetting, the place new coaching degrades beforehand acquired data. Latent reminiscence strategies compress data into mushy tokens. These representations are tightly certain to the mannequin that produced them — a limitation the analysis crew calls illustration coupling which limits transferability throughout LLMs.

MEMORY as a Separate Model

MEMO separates reminiscence from reasoning. The MEMORY mannequin is a small, devoted language mannequin educated to internalize data from a goal corpus. The EXECUTIVE mannequin is the principle LLM — frozen and queried solely by way of its customary input-output interface.

In experiments, the MEMORY mannequin is Qwen2.5-14B-Instruct. The EXECUTIVE mannequin is both Qwen2.5-32B-Instruct or Gemini-3-Flash, a proprietary closed-source mannequin. Because MEMO treats the EXECUTIVE mannequin as a black field, it doesn’t require weight entry or output logits.

https://arxiv.org/pdf/2605.15156

How the MEMORY Model is Trained

Training begins with a five-step information synthesis pipeline guided by a GENERATOR mannequin — Qwen2.5-32B-Instruct in experiments. The pipeline converts a uncooked doc corpus into a reflection QA dataset: question-answer pairs that symbolize corpus data underneath various question variations.

The 5 steps are:

  1. Fact extraction — direct extraction of explicitly acknowledged information, and oblique extraction of inferred info, run in parallel per doc chunk.
  2. Consolidation — QA pairs sharing a frequent context (entity, time interval, relationship) are merged into multi-fact pairs.
  3. Verification and rewriting — every QA pair is checked for self-containment. Pairs with unresolved pronouns or implicit references are rewritten utilizing the supply chunk or discarded.
  4. Entity surfacing — QA pairs are generated the place questions encode entity attributes and relationships, and solutions reveal entity identities. This targets the reversal curse, the place fashions educated on “A is B” fail to deduce “B is A.”
  5. Cross-document synthesis — the GENERATOR mannequin constructs QA pairs spanning a number of paperwork. It identifies two kinds of cross-document connections: converging clues (a number of paperwork about the identical entity) and parallel properties (totally different entities sharing a frequent attribute or position).

Step-5 is probably the most vital element. A leave-one-out ablation exhibits that eradicating it drops accuracy from 24.00% to six.37% on NarrativeQA. It can be the dominant supply of coaching pairs within the last dataset.

The MEMORY mannequin is then educated by way of supervised fine-tuning (SFT). The loss is computed over reply tokens solely. Source paperwork are by no means supplied at inference. The mannequin should reply from internalized parametric data.

Inference: The Structured Multi-Turn Protocol

At inference, the EXECUTIVE mannequin queries the MEMORY mannequin by way of a structured multi-turn protocol with three sequential levels.

Stage 1: Grounding. The EXECUTIVE mannequin decomposes the question into atomic sub-questions. Each targets a single figuring out constraint. The MEMORY mannequin solutions every independently.

Stage 2: Entity identification. Using the grounding responses, the EXECUTIVE mannequin points focused follow-up sub-queries. It iteratively narrows down candidate entities till one is confirmed or the stage finances runs out.

Stage 3: Answer in search of and synthesis. Conditioned on the recognized entity, the EXECUTIVE mannequin queries the MEMORY mannequin for supporting information. It then synthesizes all retrieved responses into a last reply.

The MEMORY mannequin’s responses are compact natural-language snippets. Their size is impartial of corpus measurement, so retrieval price doesn’t scale with the variety of paperwork. This contrasts with RAG, the place inference price grows with the corpus.

Experimental Results

MEMO is evaluated on three benchmarks: BrowseComp-Plus (multi-hop deep-research), NarrativeQA (discourse understanding over books and film scripts), and MuSiQue (2–4 hop reasoning over Wikipedia paragraphs). Baselines embody BM25, NV-Embed-V2, HippoRAG2, and Cartridges. Cartridges requires white-box entry to the EXECUTIVE mannequin and scored 0.00% on BrowseComp-Plus and three.75% on NarrativeQA.

On NarrativeQA with Gemini-3-Flash, MEMO achieves 53.58%. HippoRAG2 reaches 23.21% on the identical setup. On MuSiQue, MEMO achieves 60.20% towards HippoRAG2’s 57.00%. On BrowseComp-Plus, MEMO achieves 66.67% towards HippoRAG2’s 66.33%.

With Qwen2.5-32B-Instruct as EXECUTIVE mannequin, MEMO achieves 54.22% on BrowseComp-Plus and 48.30% on MuSiQue. Switching to Gemini-3-Flash yields beneficial properties of 12.45%, 26.73%, and 11.90% on the three benchmarks. The MEMORY mannequin will not be retrained when the EXECUTIVE mannequin adjustments.

Robustness to retrieval noise: The analysis crew evaluates efficiency when distractor paperwork are added to the corpus. NV-Embed-V2 and HippoRAG2 drop by as much as 6.22% on BrowseComp-Plus when one damaging doc is added per proof doc. MEMO’s accuracy on the identical benchmark adjustments by +0.55% — inside one customary deviation.

MEMORY mannequin structure robustness: The analysis crew additionally checks three MEMORY mannequin households at related parameter scale: Qwen2.5-1.5B-Instruct, Gemma3-1B-IT, and LFM2.5-1.2B-Instruct (a hybrid state-space and transformer structure). Performance is basically constant throughout all three, indicating the framework will not be delicate to the precise pretraining lineage of MEMORY mannequin.

Continual Knowledge Integration by way of Model Merging

MEMO helps incremental data updates by way of mannequin merging. When a new corpus arrives, a separate MEMORY mannequin is educated on it independently. Its job vector — the parameter distinction from the bottom mannequin — is then merged with the present MEMORY mannequin in parameter house.

The analysis crew check this on NarrativeQA utilizing TIES merging (ρ=0.3). For Ok=2 corpora, merging accumulates 48 GPU-hours versus 72 GPU-hours for full retraining — a 33% discount. At Ok=10, merging scales as Θ(Ok) whereas full retraining scales as Θ(K²), yielding a 5.5× saving (240 vs. 1,320 GPU-hours).

The merged MEMORY mannequin trails full retraining by 11.04% underneath Qwen2.5-32B-Instruct (15.81% vs. 26.85%). It trails by 19.11% underneath Gemini-3-Flash (34.47% vs. 53.58%). Despite this hole, it outperforms all retrieval baselines on NarrativeQA.

Marktechpost’s Visual Explainer

Marktechpost — Research Explainer
MEMO: Memory as a Model
01 / 06 — The Problem
LLMs Freeze After Pretraining
Their data turns into outdated because the world evolves.

Large language fashions are static as soon as pretraining ends. For functions requiring up-to-date or domain-specific data, three approaches exist — and every has a vital flaw.

🔍RAGSensitive to retrieval noise. Struggles when solutions span a number of paperwork.
Fine-TuningCauses catastrophic forgetting. Expensive. Cannot be used on proprietary LLMs.
💾Latent MemoryRepresentations are tightly coupled to 1 particular mannequin structure solely.

MEMO — Memory as a Model — from researchers at NUS, MIT CSAIL, and A*STAR addresses all three limitations concurrently.

02 / 06 — The Concept
Memory Separated From Reasoning
Two fashions. One frozen. One educated on new data.

MEMO introduces two distinct mannequin roles that function collectively.

◆ MEMORY ModelA small, devoted language mannequin educated to internalize data from a goal corpus. It shops information and cross-document relationships in its parameters. It by no means sees supply paperwork at inference — it solutions solely from what it has discovered.
◇ EXECUTIVE ModelThe important LLM — frozen and unchanged all through. It queries the MEMORY mannequin by way of focused sub-questions, causes over retrieved responses, and produces the ultimate reply. Works with any LLM, together with closed-source APIs.

In experiments: Qwen2.5-14B-Instruct as MEMORY mannequin. Qwen2.5-32B-Instruct or Gemini-3-Flash as EXECUTIVE mannequin. Only black-box API entry required — no weights, no logits.

03 / 06 — Training
How the MEMORY Model Is Built
A five-step pipeline converts uncooked paperwork into a reflection QA dataset.

Fact Extraction
Consolidation
Verification
Entity Surfacing
Cross-Doc Synthesis
01
Fact ExtractionDirect extraction of acknowledged information and oblique extraction of inferred info run in parallel per doc chunk.
02
ConsolidationQA pairs sharing a frequent entity, time interval, or relationship are merged into multi-fact pairs.
03
Verification & RewritingEach pair is checked for self-containment. Pairs with unresolved pronouns or implicit references are rewritten or discarded.
04
Entity SurfacingQA pairs are generated the place questions encode entity attributes and solutions reveal identities, concentrating on the reversal curse.
05
Cross-Document SynthesisThe most crucial step. Removing it drops NarrativeQA accuracy from 24.00% to six.37%. Constructs QA pairs spanning a number of paperwork by way of converging clues and parallel properties.

MEMORY mannequin educated by way of supervised fine-tuning (SFT) — loss over reply tokens solely. Source paperwork by no means supplied at inference.

04 / 06 — Inference
Three-Stage Query Protocol
The EXECUTIVE mannequin queries the MEMORY mannequin by way of structured sub-questions.

Complex consumer queries are decomposed throughout three sequential levels. No paperwork are retrieved — all solutions come from internalized parametric data.

S1
Grounding — Budget: 1 interplayThe consumer question is decomposed into atomic sub-questions, every concentrating on one figuring out constraint. MEMORY mannequin solutions every independently.
S2
Entity Identification — Budget: 7 interactionsUsing grounding responses, the EXECUTIVE mannequin points follow-up sub-queries to iteratively slim candidate entities till one is confirmed.
S3
Answer Seeking & Synthesis — Budget: 8 interactionsConditioned on the confirmed entity, the EXECUTIVE mannequin gathers supporting information then synthesizes all retrieved responses into a last reply.

MEMORY mannequin responses are compact natural-language snippets. Retrieval price is fastened and doesn’t scale with corpus measurement — in contrast to RAG.

05 / 06 — Advantages
What MEMO Does Differently
Compared to RAG, fine-tuning, and latent reminiscence strategies.

Other Methods
Retrieval noise considerably degrades RAG accuracy
Fine-tuning causes catastrophic forgetting within the LLM
Latent reminiscence tied to 1 particular mannequin structure
Retrieval price grows with corpus measurement at inference
Cannot be used with proprietary closed-source LLMs
Adding new data requires full retraining
MEMO
Accuracy adjustments ±1.77% underneath added distractor paperwork
Main LLM stays frozen; no catastrophic forgetting doable
Works throughout Qwen, Gemma, and LFM2.5 architectures
Fixed-size responses; price impartial of corpus measurement
Black-box appropriate — works with any LLM together with APIs
New corpora merged by way of mannequin merging with out full retraining

TIES merging (ρ=0.3) cuts compute by 33% at Ok=2 corpora and 5.5× at Ok=10 corpora vs full retraining.

06 / 06 — Results
Benchmark Performance
Qwen2.5-14B-Instruct as MEMORY mannequin. Gemini-3-Flash as EXECUTIVE mannequin.

53.58%NarrativeQAvs HippoRAG2: 23.21%
60.20%MuSiQuevs HippoRAG2: 57.00%
66.67%BrowseComp-Plusvs HippoRAG2: 66.33%

Switching EXECUTIVE mannequin from Qwen2.5-32B-Instruct to Gemini-3-Flash yields beneficial properties of +12.45%, +26.73%, and +11.90% throughout the three benchmarks — with out retraining the MEMORY mannequin.

Under retrieval noise, HippoRAG2 drops 6.22% on BrowseComp-Plus. MEMO adjustments by +0.55% on the identical benchmark — inside one customary deviation.

Source: arXiv 2605.15156 — Quek, Lee, Leong, Verma et al., NUS / MIT CSAIL / A*STAR / SMART, May 2026.

1 / 6

Marktechpost — AI Research, Simplified for Engineers
arXiv: 2605.15156

Key Takeaways

  • MEMO trains a devoted MEMORY mannequin on new data, conserving the principle LLM frozen and unchanged.
  • A five-step information synthesis pipeline converts uncooked paperwork into a reflection QA dataset capturing cross-document relationships.
  • At inference, a structured multi-turn protocol decomposes complicated queries into focused sub-queries to the MEMORY mannequin.
  • Retrieval price is fastened at inference time — it doesn't scale with corpus measurement, in contrast to RAG.
  • Model merging cuts cumulative coaching compute by 33% at Ok=2 corpora and 5.5× at Ok=10, with a measurable accuracy trade-off.


Check out the Research PaperAlso, be at liberty to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up MEMO: A Modular Framework for Training a Dedicated Memory Model on New Knowledge Without Modifying LLM Parameters appeared first on MarkTechPost.

Similar Posts