|

Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM

Tencent Hunyuan has launched HunyuanOCR, a 1B parameter imaginative and prescient language mannequin that’s specialised for OCR and doc understanding. The mannequin is constructed on Hunyuan’s native multimodal structure and runs recognizing, parsing, info extraction, visible query answering, and textual content picture translation by way of a single finish to finish pipeline.

HunyuanOCR is a light-weight different to basic VLMs akin to Gemini 2.5 and Qwen3 VL that also matches or surpasses them on OCR centric duties. It targets manufacturing use instances like doc parsing, card and receipt extraction, video subtitle extraction, and multilingual doc translation.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/principal/HunyuanOCR_Technical_Report.pdf

Architecture, Native Resolution ViT plus Lightweight LLM

HunyuanOCR makes use of 3 principal modules, a Native Resolution Visual Encoder referred to as Hunyuan ViT, an Adaptive MLP Connector, and a Lightweight Language Model. The encoder is predicated on SigLIP-v2-400M and is prolonged to assist arbitrary enter resolutions by way of adaptive patching that preserves the unique facet ratio. Images are cut up into patches in accordance to their native proportions and processed with world consideration, which improves recognition on lengthy textual content traces, lengthy paperwork, and low high quality scans.

The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visible tokens into a shorter sequence, whereas protecting info from textual content dense areas. This reduces sequence size handed to the language mannequin and lowers compute, whereas preserving OCR related particulars.

The language mannequin is predicated on the densely architected Hunyuan 0.5B mannequin and makes use of XD RoPE. XD RoPE splits rotary place embeddings into 4 subspaces for textual content, peak, width, and time. This provides the mannequin a native method to align 1D token order with 2D structure and 3D spatiotemporal construction. As a outcome, the identical stack can deal with multi column pages, cross web page flows, and sequences of video frames.

Training and inference observe a totally finish to finish paradigm. There isn’t any exterior structure evaluation or publish processing mannequin within the loop. All duties are expressed as pure language prompts and dealt with in a single ahead go. This design removes error propagation throughout pipeline phases and simplifies deployment.

Data and Pre Training Recipe

The knowledge pipeline builds greater than 200M picture textual content pairs, throughout 9 actual world eventualities, together with road views, paperwork, ads, handwritten textual content, screenshots, playing cards and certificates and invoices, sport interfaces, video frames, and creative typography. The corpus covers greater than 130 languages.

Synthetic knowledge comes from a multilingual generator that helps proper to left scripts and paragraph degree rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and native lighting modifications to simulate cell captures and different onerous circumstances.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/principal/HunyuanOCR_Technical_Report.pdf

Pre coaching follows 4 phases. Stage-1 performs imaginative and prescient language alignment with pure textual content, artificial parsing and recognition knowledge, and basic caption knowledge, utilizing 50B tokens and 8k context. Stage-2 runs multimodal pre coaching on 300B tokens that blend pure textual content with artificial recognizing, parsing, translation, and VQA samples. Stage-3 extends context size to 32k with 80B tokens targeted on lengthy paperwork and lengthy textual content. Stage-4 is utility oriented supervised effective tuning on 24B tokens of human annotated and onerous unfavorable knowledge, protecting 32k context and unified instruction templates.

Reinforcement Learning with Verifiable Rewards

After supervised coaching, HunyuanOCR is additional optimized with reinforcement studying. The analysis workforce use Group Relative Policy Optimization GRPO and a Reinforcement Learning with Verifiable Rewards setup for structured duties. For textual content recognizing, the reward is predicated on intersection over union matching of bins mixed with normalized edit distance over textual content. For doc parsing, the reward makes use of normalized edit distance between the generated construction and the reference.

For VQA and translation, the system makes use of an LLM as a decide. VQA makes use of a binary reward that checks semantic match. Translation makes use of a COMET type scoring LLM with scores in [0, 5], normalized to [0, 1]. The coaching framework enforces size limits and strict codecs, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages legitimate JSON or structured outputs.

Benchmark Results, a 1B Model Competing with Larger VLMs

On the interior textual content recognizing benchmark of 900 photographs throughout 9 classes, HunyuanOCR reaches an total rating of 70.92. It outperforms conventional pipeline strategies like PaddleOCR and BaiduOCR and in addition basic VLMs akin to Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision, regardless of utilizing far fewer parameters.

On OmniDocBench, HunyuanOCR achieves 94.10 total, with 94.73 on formulation and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures paperwork beneath folds and lighting modifications, it scores 85.21 total. On DocML, a multilingual parsing benchmark throughout 14 non Chinese and non English languages, it reaches 91.03, and the paper studies cutting-edge outcomes throughout all 14 languages.

For info extraction and VQA, HunyuanOCR reaches 92.29 accuracy on playing cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, larger than DeepSeek OCR at comparable scale and shut to bigger basic VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In textual content picture translation, HunyuanOCR makes use of the DoTA benchmark and a DocML primarily based inside set. It achieves a sturdy COMET rating on DoTA for English to Chinese doc translation, and the mannequin wins first place in Track 2.2 OCR free Small Model of the ICDAR 2025 DIMT competitors.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/principal/HunyuanOCR_Technical_Report.pdf

Key Takeaways

  • Compact finish to finish OCR VLM: HunyuanOCR is a 1B parameter OCR targeted imaginative and prescient language mannequin that connects a 0.4B native decision ViT to a 0.5B Hunyuan language mannequin by way of an MLP adapter, and runs recognizing, parsing, info extraction, VQA and translation in a single finish to finish instruction pushed pipeline with out exterior structure or detection modules.
  • Unified assist for various OCR eventualities: The mannequin is educated on greater than 200M picture textual content pairs throughout 9 eventualities, together with paperwork, road views, ads, handwritten content material, screenshots, playing cards and invoices, sport interfaces and video frames, with protection of over 130 languages in coaching and assist for greater than 100 languages in deployment.
  • Data pipeline plus reinforcement studying: Training makes use of a 4 stage recipe, imaginative and prescient language alignment, multimodal pre coaching, lengthy context pre coaching and utility oriented supervised effective tuning, adopted by reinforcement studying with group relative coverage optimization and verifiable rewards for recognizing, parsing, VQA and translation.
  • Strong benchmark outcomes for sub 3B fashions
    HunyuanOCR reaches 94.1 on OmniDocBench for doc understanding, and achieves 860 on OCRBench, which is reported as cutting-edge amongst imaginative and prescient language fashions with fewer than 3B parameters, whereas additionally outperforming a number of business OCR APIs and bigger open fashions akin to Qwen3 VL 4B on core OCR benchmarks.

Editorial Notes

HunyuanOCR is a sturdy sign that OCR particular VLMs are maturing into sensible infrastructure, not simply benchmarks. Tencent combines a 1B parameter finish to finish structure with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to ship a single mannequin that covers recognizing, parsing, IE, VQA and translation throughout greater than 100 languages, and it does so whereas reaching main scores on OCRBench for sub 3B fashions and 94.1 on OmniDocBench. Overall, HunyuanOCR marks an necessary shift towards compact, instruction pushed OCR engines which might be real looking for manufacturing deployment.


Check out the PaperModel weight and Repo. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Tencent Hunyuan Releases HunyuanOCR: a 1B Parameter End to End OCR Expert VLM appeared first on MarkTechPost.

Similar Posts