Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing
Most end-to-end OCR fashions decelerate as output grows. Each generated token provides to the KV cache. Memory rises and technology drags. Parsing dozens of pages turns into impractical. Baidu’s Unlimited OCR addresses this straight. It swaps the decoder’s consideration for a design that retains reminiscence fixed.
TL;DR
- Unlimited OCR is a 3B-parameter Mixture-of-Experts mannequin, with solely 500M parameters energetic.
- It replaces decoder consideration with Reference Sliding Window Attention (R-SWA), preserving the KV cache fixed.
- The mannequin parses dozens of pages in a single ahead cross underneath a 32K most size.
- It scores 93.23 on OmniDocBench v1.5, beating the DeepSeek OCR baseline by 6.22 factors.
- It builds on DeepSeek OCR through continue-training, not a from-scratch run.
What is Unlimited OCR?
Unlimited OCR takes DeepSeek OCR as its baseline. It retains the DeepEncoder and the Mixture-of-Experts decoder. The MoE design holds 3B whole parameters however prompts solely 500M at inference.
The DeepEncoder is the compression engine. It cascades a SAM-ViT underneath window consideration with a CLIP-ViT underneath international consideration. At the bridge, it applies 16× token compression. A 1024×1024 PDF picture turns into simply 256 visible tokens. Fewer enter tokens imply a smaller prefill.
DeepEncoder natively helps 5 decision modes, and Unlimited OCR retains two. ‘Base’ mode runs at 1024×1024 for multi-page work. ‘Gundam’ mode makes use of dynamic decision for single pages.

How R-SWA Keeps the Cache Constant
The contribution is Reference Sliding Window Attention. Standard Multi-Head Attention shops a key and worth for each token. As output size T grows, the cache grows with it. The measurement is CMHA(T) = Lm + T. Memory and latency climb with out certain.
R-SWA breaks that hyperlink. Each generated token attends to all reference tokens, which means the visible tokens and the immediate. It additionally attends to the previous n output tokens, the place n defaults to 128. Everything older is evicted. The cache turns into a mounted queue of measurement m + n.
The measurement is CR-SWA(T) = Lm + min(n, T) ≤ Lm + n. It is bounded by a fixed. As T grows far past n, the cache ratio traits towards zero. So reminiscence stays flat and per-step latency stays flat.
The analysis staff evaluate this to delicate forgetting. An individual copying a e book glances at the supply and the previous few phrases. They don’t re-read the whole lot transcribed up to now. Visual tokens by no means bear state updates. That avoids the progressive blurring seen in linear consideration. The interactive simulator under allows you to fluctuate T and watch each caches reply.
Animate decoding</button>
<button id=”reset”>Reset</button>
</div>
<div class=”playing cards”>
<div class=”card mha”>
<div class=”ok”>MHA KV cache</div>
<div class=”v” id=”mhaVal”>10,048</div>
<div class=”u”>key/worth entries</div>
</div>
<div class=”card swa”>
<div class=”ok”>R-SWA KV cache</div>
<div class=”v” id=”swaVal”>2,176</div>
<div class=”u”>key/worth entries</div>
</div>
<div class=”card”>
<div class=”ok”>Cache ratio ρ</div>
<div class=”v” id=”ratioVal”>0.217</div>
<div class=”u”>R-SWA ÷ MHA</div>
</div>
<div class=”card”>
<div class=”ok”>Memory saved</div>
<div class=”v” id=”saveVal”>78%</div>
<div class=”u”>vs normal MHA</div>
</div>
</div>
<div class=”bars”>
<div class=”bar-label”><span>Standard MHA — grows with each token</span><b id=”mhaBarTxt”>10,048</b></div>
<div class=”bar-track”><div class=”bar-fill mha” id=”mhaBar” model=”width:100%”></div></div>
<div class=”bar-label”><span>R-SWA — bounded by L<sub>m</sub> + n</span><b id=”swaBarTxt”>2,176</b></div>
<div class=”bar-track”><div class=”bar-fill swa” id=”swaBar” model=”width:21%”></div></div>
<div class=”system” id=”fMha”>C<sub>MHA</sub>(T) = L<sub>m</sub> + T = <b id=”fMhaN”>2,048 + 8,000 = 10,048</b></div>
<div class=”system” id=”fSwa”>C<sub>R-SWA</sub>(T) = L<sub>m</sub> + min(n, T) = <b id=”fSwaN”>2,048 + 128 = 2,176</b></div>
</div>
<div class=”stream-wrap”>
<div class=”stream-title”>Attention view: every new token sees <b>all reference tokens</b> plus solely the <b>final n output tokens</b>. Earlier output is evicted from the cache.</div>
<div class=”stream” id=”stream”></div>
<div class=”legend”>
<span><i class=”dot” model=”background:#b0b0b0″></i> Reference tokens (visible + immediate, at all times seen)</span>
<span><i class=”dot” model=”background:#ffffff”></i> Active window (final n tokens)</span>
<span><i class=”dot” model=”background:#1c1c1c”></i> Evicted output (soft-forgotten)</span>
</div>
</div>
<p class=”notice”>Grounding: Unlimited OCR retains the full reference cache of measurement L<sub>m</sub> however holds solely the most up-to-date n output tokens (n defaults to 128). As output size T grows far past n, the cache ratio ρ(T) traits towards zero, so MHA’s linear progress is changed by a fixed footprint. The page-to-token estimate makes use of the DeepEncoder determine of 256 tokens per 1024×1024 web page. Numbers illustrate the cache formulation in the report, not a benchmark run.</p>
<div class=”foot”>
<span>R-SWA cache formulation from the Unlimited OCR technical report (arXiv:2606.23050)</span>
<span>Built by <b>Marktechpost</b></span>
</div>
</div>
<script>
(operate(){
var pages=doc.getElementById(‘pages’),
tokens=doc.getElementById(‘tokens’),
win=doc.getElementById(‘window’);
var TOK_PER_PAGE=256;
var anim=null;
operate fmt(n){return Math.spherical(n).toLocaleString(‘en-US’);}
operate render(){
var P=+pages.worth, T=+tokens.worth, n=+win.worth;
var Lm=P*TOK_PER_PAGE;
var mha=Lm+T;
var swa=Lm+Math.min(n,T);
var ratio=swa/mha;
var saved=Math.spherical((1-ratio)*100);
doc.getElementById(‘pagesVal’).textContent=P+(P===1?’ web page’:’ pages’);
doc.getElementById(‘tVal’).textContent=fmt(T);
doc.getElementById(‘nVal’).textContent=n;
doc.getElementById(‘mhaVal’).textContent=fmt(mha);
doc.getElementById(‘swaVal’).textContent=fmt(swa);
doc.getElementById(‘ratioVal’).textContent=ratio.toFixed(3);
doc.getElementById(‘saveVal’).textContent=saved+’%’;
var maxC=mha;
doc.getElementById(‘mhaBar’).model.width=’100%’;
doc.getElementById(‘swaBar’).model.width=Math.max(2,(swa/maxC)*100)+’%’;
doc.getElementById(‘mhaBarTxt’).textContent=fmt(mha);
doc.getElementById(‘swaBarTxt’).textContent=fmt(swa);
doc.getElementById(‘fMhaN’).textContent=fmt(Lm)+’ + ‘+fmt(T)+’ = ‘+fmt(mha);
doc.getElementById(‘fSwaN’).textContent=fmt(Lm)+’ + ‘+fmt(Math.min(n,T))+’ = ‘+fmt(swa);
drawStream(T,n);
postHeight();
}
operate drawStream(T,n){
var s=doc.getElementById(‘stream’);
s.innerHTML=”;
var REF=10, OUT=34;
for(var i=0;i<REF;i++){
var r=doc.createElement(‘div’); r.className=’tok ref’; s.appendChild(r);
}
var prog=Math.min(1, T/120000);
var generated=Math.spherical(prog*OUT);
var winCells=Math.max(1,Math.spherical((n/512)*8));
for(var j=0;j<OUT;j++){
var t=doc.createElement(‘div’); t.className=’tok’;
if(j<generated){
if(j>generated-winCells){ t.className=’tok win’; }
else { t.className=’tok evicted’; }
}
s.appendChild(t);
}
}
operate postHeight(){
strive{ mum or dad.postMessage({kind:’uocr-resize’,peak:doc.physique.offsetHeight+40},’*’); }catch(e){}
}
operate play(){
if(anim){stopAnim();return;}
doc.getElementById(‘play’).textContent=’
Pause’;
tokens.worth=256;
anim=setInterval(operate(){
var v=+tokens.worth+3000;
if(v>=120000){v=120000; render(); stopAnim(); return;}
tokens.worth=v; render();
},90);
}
operate stopAnim(){clearInterval(anim);anim=null;doc.getElementById(‘play’).textContent=’
Animate decoding’;}
pages.addEventListener(‘enter’,render);
tokens.addEventListener(‘enter’,operate(){ if(anim) stopAnim(); render();});
win.addEventListener(‘enter’,render);
doc.getElementById(‘play’).addEventListener(‘click on’,play);
doc.getElementById(‘reset’).addEventListener(‘click on’,operate(){
stopAnim(); pages.worth=8; tokens.worth=8000; win.worth=128; render();
});
window.addEventListener(‘load’,render);
window.addEventListener(‘resize’,postHeight);
render();
})();
</script>
</physique>
</html>
“>
How It Was Trained
Unlimited OCR was not educated from scratch. The analysis staff continue-trained from the DeepSeek OCR checkpoint for 4,000 steps. They froze the DeepEncoder and educated solely the decoder. Training used about 2M doc samples on 8×16 A800 GPUs. The 9:1 cut up favored single-page knowledge, with multi-page samples constructed by concatenation.
Benchmark
The analysis staff evaluates on OmniDocBench v1.5 and v1.6. The fundamental discovering/stat is 93.23 total on v1.5. That beats the DeepSeek OCR baseline by 6.22 factors. The desk under compares the three associated fashions. All three share the similar 3B-A0.5B measurement.
| Metric (v1.5) | DeepSeek-OCR | DeepSeek-OCR 2 | Unlimited-OCR |
|---|---|---|---|
| Overall ↑ | 87.01 | 89.17 | 93.23 |
| Text Edit ↓ | 0.073 | 0.049 | 0.038 |
| Formula CDM ↑ | 83.37 | 86.85 | 92.61 |
| Table TEDS ↑ | 84.97 | 85.60 | 90.93 |
| Read-order Edit ↓ | 0.086 | 0.060 | 0.045 |
On OmniDocBench v1.6, Unlimited OCR reaches 93.92 total. That is the high rating in the analysis paper’s v1.6 comparability. Gains maintain throughout textual content, system, and desk recognition.
Speed improves too. On OmniDocBench in Base mode, Unlimited OCR hits 5,580 TPS towards DeepSeek OCR’s 4,951 TPS. That is a 12.7% improve. The hole widens with longer output. At a 6,000-token output ceiling, DeepSeek OCR lags Unlimited OCR by 35%.
Where It Fits: Use Cases
The fixed cache fits workloads that page-by-page methods deal with poorly.
- Whole-book transcription: Feed 40+ pages and parse them in a single steady cross. The reported edit distance stays under 0.11 at 40+ pages, with 96.90% Distinct-35.
- Document parsing pipelines: Extract textual content, tables, formulation, and studying order in a single ahead cross.
- High-throughput batch parsing: The included
infer.pylaunches an SGLang server and sends concurrent requests over a folder or PDF. - Beyond OCR: The analysis staff name R-SWA a normal parsing consideration, relevant to ASR and translation.
Running It: Minimal Code
The Transformers path wants trust_remote_code=True and a CUDA GPU. Single-image parsing makes use of Gundam mode.
import torch
from transformers import AutoModel, AutoTokenizer
title = "baidu/Unlimited-OCR"
tokenizer = AutoTokenizer.from_pretrained(title, trust_remote_code=True)
mannequin = AutoModel.from_pretrained(
title, trust_remote_code=True, use_safetensors=True,
torch_dtype=torch.bfloat16,
).eval().cuda()
mannequin.infer(
tokenizer,
immediate="<picture>doc parsing.",
image_file="your_image.jpg",
output_path="your/output/dir",
base_size=1024, image_size=640, crop_mode=True, # gundam mode
max_length=32768,
no_repeat_ngram_size=35, ngram_window=128,
save_results=True,
)
Multi-page and PDF parsing name mannequin.infer_multi in Base mode at image_size=1024. For manufacturing throughput, SGLang serves an OpenAI-compatible API utilizing the fa3 consideration backend.
Strengths and Weaknesses
Strengths:
- Constant KV cache holds reminiscence and latency flat throughout lengthy outputs.
- End-to-end SOTA scores on OmniDocBench v1.5 and v1.6.
- Only 500M energetic parameters maintain inference low-cost.
- MIT license, open weights, and twin Transformers plus SGLang help.
- R-SWA positive aspects arrive with out a measured accuracy value on single pages.
Weaknesses:
- Parsing will not be really limitless; a 32K context nonetheless bounds the prefill.
- Long prefills develop as web page depend accumulates, regardless of heavy compression.
- Multi-page runs use Base mode solely, so very small textual content might be missed.
- ASR and translation switch stays future work, not a shipped end result.
Check out the Paper, Repo and Model Weights. Also, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Baidu Releases Unlimited OCR, a 3B Model That Keeps the KV Cache Flat for Long-Document Parsing appeared first on MarkTechPost.
