Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Model for Document Parsing and Key Information Extraction (KIE)
Why Document OCR Still Remains a Hard Engineering Problem? What does it take to make OCR helpful for actual paperwork as an alternative of fresh demo photos? And can a compact multimodal mannequin deal with parsing, tables, formulation, and structured extraction with out turning inference right into a useful resource bonfire?
That is the issue focused by GLM-OCR, launched by researchers from Zhipu AI and Tsinghua University. The analysis crew presents GLM-OCR as a 0.9B-parameter compact multimodal mannequin for doc understanding. It combines a 0.4B CogViT visible encoder, a light-weight cross-modal connector, and a 0.5B GLM language decoder. The said objective is to stability doc recognition high quality with decrease latency and decrease computational value than bigger multimodal methods.
Traditional OCR methods are sometimes good at plain textual content transcription, however they wrestle when paperwork comprise combined layouts, tables, formulation, code blocks, seals, and structured fields. Recent multimodal massive language fashions enhance doc understanding, however the analysis crew argue that their measurement and commonplace autoregressive decoding make them costly for edge deployment and large-scale manufacturing. GLM-OCR is positioned as a smaller system constructed for these deployment constraints fairly than as a general-purpose vision-language mannequin tailored to OCR as an afterthought.
A Compact Architecture Built for OCR Workloads
A key technical level for this analysis is the usage of Multi-Token Prediction (MTP). Standard autoregressive decoding predicts one token at a time, which isn’t very best for OCR-style duties the place outputs are sometimes deterministic and domestically structured. GLM-OCR as an alternative predicts a number of tokens per step. The mannequin is educated to foretell 10 tokens per step and generates 5.2 tokens per decoding step on common at inference time, yielding about 50% throughput enchancment. To preserve reminiscence overhead manageable, the implementation makes use of a parameter-sharing scheme throughout the draft fashions.
Two-Stage Layout Parsing Instead of Flat Page Reading
At the system stage, GLM-OCR adopts a two-stage pipeline. The first stage makes use of PP-DocLayout-V3 for structure evaluation, which detects structured areas on the web page. The second stage performs parallel region-level recognition over these detected areas. This is essential as a result of the mannequin just isn’t merely studying a complete web page left-to-right as a generic vision-language mannequin would possibly. It first breaks down the web page into semantically significant areas, which improves effectivity and makes the system extra strong on paperwork with difficult layouts.
Document Parsing and KIE Use Different Output Paths
The structure additionally separates two associated doc duties. For doc parsing, the pipeline makes use of structure detection and area processing to supply structured outputs reminiscent of Markdown and JSON. For Key Information Extraction (KIE), the analysis crew describes a special path: the complete doc picture is fed to the mannequin with a activity immediate, and the mannequin straight generates JSON containing the extracted fields. That distinction issues as a result of GLM-OCR just isn’t introduced as a single monolithic page-to-text mannequin. It is a structured era system with completely different working modes relying on the duty.
A Four-Stage Training Pipeline with Task-Specific Rewards
The coaching recipe is break up into 4 levels. Stage 1 trains the imaginative and prescient encoder on image-text pairs and grounding or retrieval knowledge. Stage 2.1 performs multimodal pretraining on image-text, doc parsing, grounding, and VQA knowledge. Stage 2.2 provides the MTP goal. Stage 3 is supervised fine-tuning on OCR-specific duties together with textual content recognition, method transcription, desk construction restoration, and KIE. Stage 4 applies reinforcement studying utilizing GRPO. The reward design is task-specific: Normalized Edit Distance for textual content recognition, CDM rating for method recognition, TEDS rating for desk recognition, and field-level F1 for KIE, together with structural penalties reminiscent of repetition penalties, malformed construction penalties, and JSON validation constraints.
Benchmark Results Show Strong Performance, With Important Caveats
On public benchmarks, GLM-OCR stories robust outcomes throughout a number of doc duties. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Text), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it stories 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE. The analysis crew notes that outcomes for Gemini-3-Pro and GPT-5.2-2025-12-11 are proven solely for reference and are excluded from the best-score rating, which is a crucial element when deciphering claims about mannequin management.

The benchmark story is robust, but it surely wants cautious phrasing. GLM-OCR achieves the best reported scores among the many evaluated non-reference fashions on OmniDocBench v1.5, OCRBench (Text), UniMERNet, and TEDS_TEST. On PubTabNet, nonetheless, it does not lead general; MinerU 2.5 stories 88.4 versus GLM-OCR’s 85.2. For KIE, GLM-OCR outperforms the listed open-source rivals within the above desk, however Gemini-3-Pro scores larger on each Nanonets-KIE and Handwritten-KIE within the reference column. So the reserach crew helps a robust aggressive declare, however not a blanket ‘greatest at every part’ declare.
Deployment Details
The analysis crew state that GLM-OCR helps vLLM, SGLang, and Ollama, and will be fine-tuned by way of LLaMA-Factory. They additionally report throughput of 0.67 photos/s and 1.86 PDF pages/s below their analysis setup. In addition, they describe a MaaS API priced at 0.2 RMB per million tokens, with instance value estimates for scanned photos and simple-layout PDFs. These particulars counsel that GLM-OCR is being framed as each a analysis mannequin and a deployable system.
Key Takeaways
- GLM-OCR is a compact 0.9B multimodal OCR mannequin constructed with a 0.4B CogViT encoder and 0.5B GLM decoder.
- It makes use of Multi-Token Prediction (MTP) to enhance decoding effectivity, reaching 5.2 tokens per step on common and about 50% larger throughput.
- The mannequin makes use of a two-stage pipeline: PP-DocLayout-V3 handles structure evaluation, then GLM-OCR performs parallel region-level recognition.
- It helps each doc parsing and KIE: parsing outputs Markdown/JSON, whereas KIE straight generates JSON from the complete doc picture.
- Benchmark outcomes are robust however not common wins: GLM-OCR leads a number of reported non-reference benchmarks, however MinerU 2.5 is larger on PubTabNet, and Gemini-3-Pro is larger on the reference-only KIE scores.
Check out Paper, Repo and Model Page. Also, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Model for Document Parsing and Key Information Extraction (KIE) appeared first on MarkTechPost.
