|

Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing

How do you change complicated, multilingual paperwork—dense layouts, small scripts, formulation, charts, and handwriting—into trustworthy structured Markdown/JSON with state-of-the-art accuracy whereas holding inference latency and reminiscence low sufficient for actual deployments?Baidu’s PaddlePaddle group has launched PaddleOCR-VL, a 0.9B-parameter vision-language mannequin designed for end-to-end doc parsing throughout textual content, tables, formulation, charts, and handwriting. The core mannequin combines a NaViT-style (Native-resolution ViT) dynamic-resolution imaginative and prescient encoder with the ERNIE-4.5-0.3B decoder. It helps 109 languages.

https://ernie.baidu.com/weblog/publication/PaddleOCR-VL_Technical_Report.pdf

Understanding the system design

PaddleOCR-VL is deployed as a two-stage pipeline. Stage one (PP-DocLayoutV2) performs page-level structure evaluation: an RT-DETR detector localizes and classifies areas; a pointer community predicts studying order. Stage two (PaddleOCR-VL-0.9B) conducts element-level recognition conditioned on the detected structure. Final outputs are aggregated to Markdown and JSON for downstream consumption. This decoupling mitigates long-sequence decoding latency and instability that end-to-end VLMs face on dense, multi-column, blended textual content–graphic pages.

At the mannequin degree, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder (native-resolution sequence packing) with a 2-layer MLP projector and the ERNIE-4.5-0.3B language mannequin; 3D-RoPE is used for positional illustration. The technical report attributes decrease hallucinations and higher text-dense efficiency to native-resolution processing relative to fixed-resize or tiling approaches. The NaViT concept—patch-and-pack variable-resolution inputs with out harmful resizing—originates from prior work displaying improved effectivity and robustness; PaddleOCR-VL adopts this encoder type immediately.

Benchmarks

PaddleOCR-VL achieves state-of-the-art outcomes on OmniDocBench v1.5 and aggressive or main scores on v1.0, overlaying total high quality in addition to sub-tasks (textual content edit distances, Formula-CDM, Table-TEDS/TEDS-S, and reading-order edit), with complementary power on olmOCR-Bench and in-house handwriting, desk, components, and chart evaluations.

https://ernie.baidu.com/weblog/publication/PaddleOCR-VL_Technical_Report.pdf

Key Takeaways

  • 0.9B-parameter PaddleOCR-VL integrates a NaViT-style dynamic-resolution encoder with ERNIE-4.5-0.3B for doc parsing.
  • Targets end-to-end extraction throughout textual content, tables, formulation, charts, and handwriting with structured Markdown/JSON outputs.
  • Claims SOTA efficiency on public doc benchmarks with quick inference appropriate for deployment.
  • Supports 109 languages, together with small scripts and sophisticated web page layouts.

Editorial Comments

This launch is significant as a result of it joins a NaViT-style dynamic-resolution visible encoder with the light-weight ERNIE-4.5-0.3B decoder to ship SOTA page-level doc parsing and element-level recognition at sensible inference value. The two-stage PP-DocLayoutV2 → PaddleOCR-VL-0.9B design stabilizes studying order and preserves native typography cues, which matter for small scripts, formulation, charts, and handwriting throughout 109 languages. Structured Markdown/JSON outputs and elective vLLM/SGLang acceleration make the system operationally clear for manufacturing doc intelligence.


Check out the Technical PaperModel on HF, and Technical details . Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Baidu’s PaddlePaddle Team Releases PaddleOCR-VL (0.9B): a NaViT-style + ERNIE-4.5-0.3B VLM Targeting End-to-End Multilingual Document Parsing appeared first on MarkTechPost.

Similar Posts