IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction
IBM has introduced the discharge of Granite 4.0 3B Vision, a vision-language mannequin (VLM) engineered particularly for enterprise-grade doc information extraction. Departing from the monolithic method of bigger multimodal fashions, the 4.0 Vision launch is architected as a specialised adapter designed to carry high-fidelity visible reasoning to the Granite 4.0 Micro language spine.
This launch represents a transition towards modular, extraction-focused AI that prioritizes structured information accuracy—akin to changing advanced charts to code or tables to HTML—over general-purpose picture captioning.
Architecture: Modular LoRA and DeepStack Integration
The Granite 4.0 3B Vision mannequin is delivered as a LoRA (Low-Rank Adaptation) adapter with roughly 0.5B parameters. This adapter is designed to be loaded on prime of the Granite 4.0 Micro base mannequin, a 3.5B parameter dense language mannequin. This design permits for a ‘dual-mode’ deployment: the bottom mannequin can deal with text-only requests independently, whereas the imaginative and prescient adapter is activated solely when multimodal processing is required.
Vision Encoder and Patch Tiling
The visible element makes use of the google/siglip2-so400m-patch16-384 encoder. To keep excessive decision throughout numerous doc layouts, the mannequin employs a tiling mechanism. Input pictures are decomposed into 384×384 patches, that are processed alongside a downscaled international view of all the picture. This method ensures that fantastic particulars—akin to subscripts in formulation or small information factors in charts—are preserved earlier than they attain the language spine.
The DeepStack Backbone
To bridge the imaginative and prescient and language modalities, IBM makes use of a variant of the DeepStack structure. This entails deeply stacking visible tokens into the language mannequin throughout 8 particular injection factors. By routing visible options into a number of layers of the transformer, the mannequin achieves a tighter alignment between the ‘what’ (semantic content material) and the ‘the place’ (spatial structure), which is vital for sustaining construction throughout doc parsing.
Training Curriculum: Focused on Chart and Table Extraction
The coaching of Granite 4.0 3B Vision displays a strategic shift towards specialised extraction duties. Rather than relying solely on normal image-text datasets, IBM utilized a curated combination of instruction-following information centered on advanced doc constructions.
- ChartWeb Dataset: The mannequin was refined utilizing ChartWeb, a million-scale multimodal dataset designed for sturdy chart understanding.
- Code-Guided Pipeline: A key technical spotlight of the coaching entails a “code-guided” method for chart reasoning. This pipeline makes use of aligned information consisting of the unique plotting code, the ensuing rendered picture, and the underlying information desk, permitting the mannequin to study the structural relationship between visible representations and their supply information.
- Extraction Tuning: The mannequin was fine-tuned on a mix of datasets specializing in Key-Value Pair (KVP) extraction, desk construction recognition, and changing visible charts into machine-readable codecs like CSV, JSON, and OTSL.
Performance and Evaluation Benchmarks
In technical evaluations, Granite 4.0 3B Vision has been benchmarked towards a number of industry-standard suites for doc understanding. It is essential to notice that datasets like PubTables-v2 and OmniDocBench are utilized as analysis benchmarks to confirm the mannequin’s zero-shot efficiency in real-world situations.
| Task | Evaluation Benchmark | Metric |
| KVP Extraction | VAREX | 85.5% Exact Match (Zero-Shot) |
| Chart Reasoning | ChartWeb (Human-Verified Test Set) | High Accuracy in Chart2Summary |
| Table Extraction | TableVQA-Bench & OmniDocBench | Evaluated through TEDS and HTML extraction |
The mannequin presently ranks third amongst fashions within the 2–4B parameter class on the VAREX leaderboard (as of March 2026), demonstrating its effectivity in structured extraction regardless of its compact measurement.


Key Takeaways
- Modular LoRA Architecture: The mannequin is a 0.5B parameter LoRA adapter that operates on the Granite 4.0 Micro (3.5B) spine. This design permits a single deployment to deal with text-only workloads effectively whereas activating imaginative and prescient capabilities solely when wanted.
- High-Resolution Tiling: Utilizing the google/siglip2-so400m-patch16-384 encoder, the mannequin processes pictures by tiling them into 384×384 patches alongside a world downscaled view, making certain that fantastic particulars in advanced paperwork are preserved.
- DeepStack Injection: To enhance structure consciousness, the mannequin makes use of a DeepStack method with 8 injection factors. This routes semantic options to earlier layers and spatial particulars to later layers, which is vital for correct desk and chart extraction.
- Specialized Extraction Training: Beyond normal instruction following, the mannequin was refined utilizing ChartWeb and a ‘code-guided’ pipeline that aligns plotting code, pictures, and information tables to assist the mannequin internalize the logic of visible information constructions.
- Developer-Ready Integration: The launch is Apache 2.0 licensed and options native help for vLLM (through a customized mannequin implementation) and Docling, IBM’s device for changing unstructured PDFs into machine-readable JSON or HTML.
Check out the Technical details and Model Weight. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up IBM Releases Granite 4.0 3B Vision: A New Vision Language Model for Enterprise Grade Document Data Extraction appeared first on MarkTechPost.
