IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model
IBM has launched Granite-Docling-258M, an open-source (Apache-2.0) vision-language mannequin designed particularly for end-to-end doc conversion. The mannequin targets layout-faithful extraction—tables, code, equations, lists, captions, and studying order—emitting a structured, machine-readable illustration relatively than lossy Markdown. It is obtainable on Hugging Face with a dwell demo and MLX construct for Apple Silicon.
What’s new in comparison with SmolDocling?
Granite-Docling is the product-ready successor to SmolDocling-256M. IBM changed the sooner spine with a Granite 165M language mannequin and upgraded the imaginative and prescient encoder to SigLIP2 (base, patch16-512) whereas retaining the Idefics3-style connector (pixel-shuffle projector). The ensuing mannequin has 258M parameters and reveals constant accuracy features throughout structure evaluation, full-page OCR, code, equations, and tables (see metrics beneath). IBM additionally addressed instability failure modes noticed within the preview mannequin (e.g., repetitive token loops).
Architecture and coaching pipeline
- Backbone: Idefics3-derived stack with SigLIP2 imaginative and prescient encoder → pixel-shuffle connector → Granite 165M LLM.
- Training framework: nanoVLM (light-weight, pure-PyTorch VLM coaching toolkit).
- Representation: Outputs DocTags, an IBM-authored markup designed for unambiguous doc construction (parts + coordinates + relationships), which downstream instruments convert to Markdown/HTML/JSON.
- Compute: Trained on IBM’s Blue Vela H100 cluster.
Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)
Evaluated with docling-eval
, LMMS-Eval, and task-specific datasets:
- Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
- Full-page OCR: F1 0.84 vs. 0.80; decrease edit distance.
- Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
- Equation recognition: F1 0.968 vs. 0.947.
- Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content material 0.96 vs. 0.76.
- Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
- Stability: “Avoids infinite loops extra successfully” (production-oriented repair).
Multilingual help
Granite-Docling provides experimental help for Japanese, Arabic, and Chinese. IBM marks this as early-stage; English stays the first goal.
How the DocTags pathway modifications Document AI
Conventional OCR-to-Markdown pipelines lose structural info and complicate downstream retrieval-augmented era (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves desk topology, inline/floating math, code blocks, captions, and studying order with specific coordinates, bettering index high quality and grounding for RAG and analytics.
Inference and integration
- Docling Integration (really useful): The
docling
CLI/SDK robotically pulls Granite-Docling and converts PDFs/workplace docs/photos to a number of codecs. IBM positions the mannequin as a element inside Docling pipelines relatively than a common VLM. - Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a devoted MLX construct is optimized for Apple Silicon. A Hugging Face Space offers an interactive demo (ZeroGPU).
- License: Apache-2.0.
Why Granite-Docling?
For enterprise doc AI, small VLMs that protect construction scale back inference value and pipeline complexity. Granite-Docling replaces a number of single-purpose fashions (structure, OCR, desk, code, equations) with a single element that emits a richer intermediate illustration, bettering downstream retrieval and conversion constancy. The measured features—in TEDS for tables, F1 for code/equations, and lowered instability—make it a sensible improve from SmolDocling for manufacturing workflows.
Demo
Summary
Granite-Docling-258M marks a major development in compact, structure-preserving doc AI. By combining IBM’s Granite spine, SigLIP2 imaginative and prescient encoder, and the nanoVLM coaching framework, it delivers enterprise-ready efficiency throughout tables, equations, code, and multilingual textual content—all whereas remaining light-weight and open-source beneath Apache 2.0. With measurable features over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling offers a sensible basis for doc conversion and RAG workflows the place precision and reliability are vital.
Check out the Models on Hugging Face and Demo here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.
The publish IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model appeared first on MarkTechPost.