IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model

ByRicardo September 18, 2025

IBM has launched Granite-Docling-258M, an open-source (Apache-2.0) vision-language mannequin designed particularly for end-to-end doc conversion. The mannequin targets layout-faithful extraction—tables, code, equations, lists, captions, and studying order—emitting a structured, machine-readable illustration relatively than lossy Markdown. It is obtainable on Hugging Face with a dwell demo and MLX construct for Apple Silicon.

What’s new in comparison with SmolDocling?

Granite-Docling is the product-ready successor to SmolDocling-256M. IBM changed the sooner spine with a Granite 165M language mannequin and upgraded the imaginative and prescient encoder to SigLIP2 (base, patch16-512) whereas retaining the Idefics3-style connector (pixel-shuffle projector). The ensuing mannequin has 258M parameters and reveals constant accuracy features throughout structure evaluation, full-page OCR, code, equations, and tables (see metrics beneath). IBM additionally addressed instability failure modes noticed within the preview mannequin (e.g., repetitive token loops).

Architecture and coaching pipeline

Backbone: Idefics3-derived stack with SigLIP2 imaginative and prescient encoder → pixel-shuffle connector → Granite 165M LLM.
Training framework: nanoVLM (light-weight, pure-PyTorch VLM coaching toolkit).
Representation: Outputs DocTags, an IBM-authored markup designed for unambiguous doc construction (parts + coordinates + relationships), which downstream instruments convert to Markdown/HTML/JSON.
Compute: Trained on IBM’s Blue Vela H100 cluster.

Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)

Evaluated with docling-eval, LMMS-Eval, and task-specific datasets:

Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
Full-page OCR: F1 0.84 vs. 0.80; decrease edit distance.
Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
Equation recognition: F1 0.968 vs. 0.947.
Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content material 0.96 vs. 0.76.
Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
Stability: “Avoids infinite loops extra successfully” (production-oriented repair).

Multilingual help

Granite-Docling provides experimental help for Japanese, Arabic, and Chinese. IBM marks this as early-stage; English stays the first goal.

How the DocTags pathway modifications Document AI

Conventional OCR-to-Markdown pipelines lose structural info and complicate downstream retrieval-augmented era (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves desk topology, inline/floating math, code blocks, captions, and studying order with specific coordinates, bettering index high quality and grounding for RAG and analytics.

Inference and integration

Docling Integration (really useful): The docling CLI/SDK robotically pulls Granite-Docling and converts PDFs/workplace docs/photos to a number of codecs. IBM positions the mannequin as a element inside Docling pipelines relatively than a common VLM.
Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a devoted MLX construct is optimized for Apple Silicon. A Hugging Face Space offers an interactive demo (ZeroGPU).
License: Apache-2.0.

Why Granite-Docling?

For enterprise doc AI, small VLMs that protect construction scale back inference value and pipeline complexity. Granite-Docling replaces a number of single-purpose fashions (structure, OCR, desk, code, equations) with a single element that emits a richer intermediate illustration, bettering downstream retrieval and conversion constancy. The measured features—in TEDS for tables, F1 for code/equations, and lowered instability—make it a sensible improve from SmolDocling for manufacturing workflows.

Demo

Summary

Granite-Docling-258M marks a major development in compact, structure-preserving doc AI. By combining IBM’s Granite spine, SigLIP2 imaginative and prescient encoder, and the nanoVLM coaching framework, it delivers enterprise-ready efficiency throughout tables, equations, code, and multilingual textual content—all whereas remaining light-weight and open-source beneath Apache 2.0. With measurable features over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling offers a sensible basis for doc conversion and RAG workflows the place precision and reliability are vital.

Check out the Models on Hugging Face and Demo here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model appeared first on MarkTechPost.

Artificial Intelligence Editors Pick

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems
ByRicardo November 1, 2025

How can a small mannequin study to clear up duties it at the moment fails at, with out rote imitation or counting on an accurate rollout? A crew of researchers from Google Cloud AI Research and UCLA have launched a coaching framework, ‘Supervised Reinforcement Learning’ (SRL), that makes 7B scale fashions truly study from very…

Read More Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems
AI Paper Summary AI Shorts Applications Artificial Intelligence Editors Pick Staff Tech News Technology

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
ByRicardo June 16, 2025

The Inefficiency of Static Chain-of-Thought Reasoning in LRMs Recent LRMs achieve top performance by using detailed CoT reasoning to solve complex tasks. However, many simple tasks they handle could be solved by smaller models with fewer tokens, making such elaborate reasoning unnecessary. This echoes human thinking, where we use fast, intuitive responses for easy problems…

Read More OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
Agentic AI Artificial Intelligence

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks
ByRicardo March 4, 2026

Current end-to-end robotic policies, specifically Vision-Language-Action (VLA) models, typically operate on a single observation or a very short history. This ‘lack of memory’ makes long-horizon tasks, such as cleaning a kitchen or following a complex recipe, computationally intractable or prone to failure. To address this, researchers from Physical Intelligence, Stanford, UC Berkeley, and MIT have…

Read More Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks
AI Paper Summary Artificial Intelligence

LongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data
ByRicardo July 1, 2025

Introduction to Ultra-Long Text Generation Challenges Generating ultra-long texts that span thousands of words is becoming increasingly important for real-world tasks such as storytelling, legal writing, and educational materials. However, large language models still face significant challenges, including length limits and quality issues, as their outputs become increasingly longer. Common problems include incoherence, topic drift,…

Read More LongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data
Agentic AI Artificial Intelligence

MIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents
ByRicardo June 26, 2025

Modern language agents need to handle multi-turn conversations, retrieving and updating information as tasks evolve. However, most current systems simply add all past interactions to the prompt, regardless of relevance. This leads to bloated memory usage, slower performance, and poor reasoning on longer inputs that weren’t seen during training. Real-world examples, such as research or…

Read More MIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents
Artificial Intelligence Computer Vision

Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry
ByRicardo September 17, 2025

A staff of researchers from Meta Reality Labs and Carnegie Mellon University has launched MapAnything, an end-to-end transformer structure that straight regresses factored metric 3D scene geometry from photographs and elective sensor inputs. Released beneath Apache 2.0 with full coaching and benchmarking code, MapAnything advances past specialist pipelines by supporting over 12 distinct 3D imaginative…

Read More Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model

What’s new in comparison with SmolDocling?

Architecture and coaching pipeline

Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)

Multilingual help

How the DocTags pathway modifications Document AI

Inference and integration

Why Granite-Docling?

Demo

Summary

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step Wise Framework with Expert Trajectories to Teach Small Language Models to Reason through Hard Problems

OThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks

LongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data

MIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents

Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

What’s new in comparison with SmolDocling?

Architecture and coaching pipeline

Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)

Multilingual help

How the DocTags pathway modifications Document AI

Inference and integration

Why Granite-Docling?

Demo

Summary

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!