Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding

ByRicardo September 7, 2025

A staff of researchers from Meta Superintelligence Labs, National University of Singapore and Rice University has unveiled REFRAG (REpresentation For RAG), a decoding framework that rethinks retrieval-augmented era (RAG) effectivity. REFRAG extends LLM context home windows by 16× and achieves as much as a 30.85× acceleration in time-to-first-token (TTFT) with out compromising accuracy.

Why is lengthy context such a bottleneck for LLMs?

The consideration mechanism in giant language fashions scales quadratically with enter size. If a doc is twice as lengthy, the compute and reminiscence price can develop fourfold. This not solely slows inference but additionally will increase the dimensions of the key-value (KV) cache, making large-context functions impractical in manufacturing techniques. In RAG settings, most retrieved passages contribute little to the ultimate reply, however the mannequin nonetheless pays the complete quadratic worth to course of them.

How does REFRAG compress and shorten context?

REFRAG introduces a light-weight encoder that splits retrieved passages into fixed-size chunks (e.g., 16 tokens) and compresses every right into a dense chunk embedding. Instead of feeding hundreds of uncooked tokens, the decoder processes this shorter sequence of embeddings. The result’s a 16× discount in sequence size, with no change to the LLM structure.

How is acceleration achieved?

By shortening the decoder’s enter sequence, REFRAG reduces the quadratic consideration computation and shrinks the KV cache. Empirical outcomes present 16.53× TTFT acceleration at okay=16 and 30.85× acceleration at okay=32, far surpassing prior state-of-the-art CEPE (which achieved solely 2–8×). Throughput additionally improves by as much as 6.78× in comparison with LLaMA baselines.

How does REFRAG protect accuracy?

A reinforcement studying (RL) coverage supervises compression. It identifies probably the most information-dense chunks and permits them to bypass compression, feeding uncooked tokens straight into the decoder. This selective technique ensures that vital particulars—similar to actual numbers or uncommon entities—aren’t misplaced. Across a number of benchmarks, REFRAG maintained or improved perplexity in comparison with CEPE whereas working at far decrease latency.

What do the experiments reveal?

REFRAG was pretrained on 20B tokens from the SlimPajama corpus (Books + arXiv) and examined on long-context datasets together with Book, Arxiv, PG19, and ProofPile. On RAG benchmarks, multi-turn dialog duties, and long-document summarization, REFRAG persistently outperformed sturdy baselines:

16× context extension past normal LLaMA-2 (4k tokens).
~9.3% perplexity enchancment over CEPE throughout 4 datasets.
Better accuracy in weak retriever settings, the place irrelevant passages dominate, as a result of skill to course of extra passages beneath the identical latency finances.

Summary

REFRAG exhibits that long-context LLMs don’t must be gradual or memory-hungry. By compressing retrieved passages into compact embeddings, selectively increasing solely the necessary ones, and rethinking how RAG decoding works, Meta Superintelligence Labs has made it attainable to course of a lot bigger inputs whereas working dramatically quicker. This makes large-context functions—like analyzing complete experiences, dealing with multi-turn conversations, or scaling enterprise RAG techniques—not solely possible however environment friendly, with out compromising accuracy.

FAQs

Q1. What is REFRAG?
REFRAG (REpresentation For RAG) is a decoding framework from Meta Superintelligence Labs that compresses retrieved passages into embeddings, enabling quicker and longer-context inference in LLMs.

Q2. How a lot quicker is REFRAG in comparison with current strategies?
REFRAG delivers as much as 30.85× quicker time-to-first-token (TTFT) and 6.78× throughput enchancment in comparison with LLaMA baselines, whereas outperforming CEPE.

Q3. Does compression scale back accuracy?
No. A reinforcement studying coverage ensures vital chunks stay uncompressed, preserving key particulars. Across benchmarks, REFRAG maintained or improved accuracy relative to prior strategies.

This fall. Where will the code be accessible?
Meta Superintelligence Labs will launch REFRAG on GitHub at facebookresearch/refrag

Check out the PAPER here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter.

The submit Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding appeared first on MarkTechPost.

AI Paper Summary AI Shorts

New from Chinese Academy of Sciences: Stream-Omni, an LLM for Cross-Modal Real-Time AI
ByRicardo June 24, 2025

Understanding the Limitations of Current Omni-Modal Architectures Large multimodal models (LMMs) have shown outstanding omni-capabilities across text, vision, and speech modalities, creating vast potential for diverse applications. While vision-oriented LMMs have shown success, omni-modal LMMs that support speech interaction based on visual information face challenges due to the intrinsic representational discrepancies across modalities. Recent omni-modal…

Read More New from Chinese Academy of Sciences: Stream-Omni, an LLM for Cross-Modal Real-Time AI
AI Paper Summary Artificial Intelligence

Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals
ByRicardo July 30, 2025

Reinforcement Learning with Verifiable Rewards (RLVR) allows LLMs to perform complex reasoning on tasks with clear, verifiable outcomes, with strong performance in mathematics and coding. However, many real-world scenarios lack such explicit verifiable answers, posing a challenge for training models without direct reward signals. Current methods address this gap through RLHF via preference ranking, where…

Read More Rubrics as Rewards (RaR): A Reinforcement Learning Framework for Training Language Models with Structured, Multi-Criteria Evaluation Signals
AI Paper Summary AI Shorts

NVIDIA AI Releases Nemotron Nano 2 AI Models: A Production-Ready Enterprise AI Model Family and 6x Faster than Similar Sized Model
ByRicardo August 19, 2025August 19, 2025

NVIDIA has unveiled the Nemotron Nano 2 household, introducing a line of hybrid Mamba-Transformer giant language fashions (LLMs) that not solely push state-of-the-art reasoning accuracy but in addition ship as much as 6× larger inference throughput than fashions of comparable dimension. This launch stands out with unprecedented transparency in knowledge and methodology, as NVIDIA offers…

Read More NVIDIA AI Releases Nemotron Nano 2 AI Models: A Production-Ready Enterprise AI Model Family and 6x Faster than Similar Sized Model
AI Paper Summary AI Shorts

Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning
ByRicardo June 23, 2025

Sakana AI introduces a novel framework for reasoning language models (LLMs) with a focus on efficiency and reusability: Reinforcement-Learned Teachers (RLTs). Traditional reinforcement learning (RL) approaches in LLMs are plagued by sparse reward signals and prohibitively high computational demands. By contrast, RLTs redefine the teacher-student paradigm by training smaller models to act as optimized instructors,…

Read More Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning
AI Shorts Applications

A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compression, Indexing, and Visualization Techniques
ByRicardo September 17, 2025

In this tutorial, we take a deep dive into the capabilities of Zarr, a library designed for environment friendly storage & manipulation of enormous, multidimensional arrays. We start by exploring the fundamentals, creating arrays, setting chunking methods, and modifying values immediately on disk. From there, we broaden into extra superior operations comparable to experimenting with…

Read More A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compression, Indexing, and Visualization Techniques
AI Paper Summary AI Shorts

Tiny Recursive Model (TRM): A Tiny 7M Model that Surpass DeepSeek-R1, Gemini 2.5 pro, and o3-mini at Reasoning on both ARG-AGI 1 and ARC-AGI 2
ByRicardo October 9, 2025

Can an iterative draft–revise solver that repeatedly updates a latent scratchpad outperform far bigger autoregressive LLMs on ARC-AGI? Samsung SAIT (Montreal) has launched Tiny Recursive Model (TRM)—a two-layer, ~7M-parameter recursive reasoner that studies 44.6–45% check accuracy on ARC-AGI-1 and 7.8–8% on ARC-AGI-2, surpassing outcomes reported for considerably bigger language fashions reminiscent of DeepSeek-R1, o3-mini-high, and…

Read More Tiny Recursive Model (TRM): A Tiny 7M Model that Surpass DeepSeek-R1, Gemini 2.5 pro, and o3-mini at Reasoning on both ARG-AGI 1 and ARC-AGI 2

Meta Superintelligence Labs Introduces REFRAG: Scaling RAG with 16× Longer Contexts and 31× Faster Decoding

Table of contents

Why is lengthy context such a bottleneck for LLMs?