|

Structured PDF-to-JSON: A Guide to Open-Source Extraction Models in 2026

Most enterprise knowledge nonetheless sits inside PDFs, scans, and slide decks. Large language fashions and brokers can not use that knowledge till it turns into structured JSON. Open-source doc extraction has develop into the usual method to try this conversion by yourself {hardware}.

Two completely different issues conceal beneath the phrase ‘PDF to JSON.’ The first is schema-driven extraction: you outline fields, and a mannequin fills them with values. The second is doc parsing: a mannequin reconstructs the web page into structured JSON or Markdown. Most groups want one, generally each. Choosing the mistaken class prices actual time.

Open weights matter right here for price and privateness. Proprietary APIs can price 1000’s of {dollars} per million pages, they usually require sending paperwork off-premise. Local fashions take away each constraints. Below are the fashions and toolkits value evaluating, grouped by what they really do.

Two classes, one phrase

Schema-driven extraction takes a doc and a JSON schema, then returns values on your fields. Use it for invoices, varieties, contracts, and receipts, the place you understand the fields in advance.

Document parsing reconstructs the doc itself. It detects structure, studying order, tables, formulation, and code, then exports JSON or Markdown. Use it to put together clear corpora for retrieval-augmented technology (RAG) and brokers.

Category 1: Schema-driven structured extraction

Datalab lift

lift is a 9B imaginative and prescient mannequin from Datalab, the staff behind Marker and Surya. You go a JSON schema, and raise returns JSON that matches it. Schema-constrained decoding ensures the output is legitimate JSON. The mannequin is constructed on Qwen 3.5 and runs domestically via Hugging Face or remotely via a vLLM server.

It handles multi-page paperwork in a single go, together with values that span pages. It ships a CLI, a Python API, and a Streamlit ‘Schema Studio’ for constructing and testing schemas.

pip set up lift-pdf

# Start the vLLM server, then extract to your schema
lift_vllm
lift_extract enter.pdf ./output --schema schema.json

from raise import extract

outcome = extract("doc.pdf", "schema.json")
if outcome.extraction shouldn't be None:
    knowledge = outcome.extraction  # dict matching your schema

On Datalab’s 225-document benchmark, raise reaches 90.2% discipline accuracy at 9.5s median latency. It leads NuExtract3 (81.5%) and Qwen3.5-9B (76.3%) on discipline accuracy. It trails Gemini Flash 3.5 (91.3%) and the hosted Datalab API (95.9%). Note that full-document accuracy stays low for all native fashions, with raise at 20.9%. Getting each discipline proper in one doc stays exhausting.

The code is Apache-2.0. The weights use a modified OpenRAIL-M license, free for analysis, private use, and startups beneath $5M in funding or income. Commercial self-hosting wants a license, and the weights can’t be used competitively with the Datalab API.

NuMind NuExtract 3

NuExtract 3 is a 4B vision-language mannequin from NuMind. It unifies two duties in one mannequin: structured extraction (doc to JSON) and content material extraction (OCR to Markdown). You present an enter and a JSON template describing the fields you want. The mannequin is educated with reinforcement studying to add extraction-specific reasoning, which you’ll change on or off per request.

NuExtract 3 is multimodal, multilingual, and based mostly on a Qwen spine. It serves via vLLM with an OpenAI-compatible API, and a Python SDK is accessible by way of pip set up numind. NuMind positions it as a reference open mannequin for each structured and content material extraction at its dimension. Check the model card for precise license phrases earlier than business use.

Category 2: Document parsing to structured JSON and Markdown

IBM Docling

Docling began at IBM Research and is now hosted by the LF AI & Data Foundation. It parses PDF, DOCX, PPTX, XLSX, HTML, photographs, and extra. Output codecs embody Markdown, HTML, lossless JSON, and DocTags. Its core is the DoclingDocument illustration, which preserves structure, studying order, tables, and formulation as LaTeX.

Docling runs domestically for air-gapped environments. It integrates with LangChain, LlamaIndex, Crew AI, and Haystack, and ships an MCP server and a Docling Serve mode. The undertaking carries a permissive MIT license. IBM additionally provides a managed model via watsonx.

IBM Granite-Docling-258M

Granite-Docling-258M is a compact 258M vision-language mannequin from IBM. It performs one-shot doc conversion inside Docling pipelines. Despite its dimension, it handles OCR, structure, tables, code, and equations, and outputs DocTags. On an A100 GPU, it averages roughly 0.35 seconds per web page.

The mannequin builds on the Idefics3 structure, with a SigLIP2 encoder and a Granite 165M language spine. It is launched beneath Apache 2.0. IBM states it’s constructed for doc conversion, not basic picture understanding.

OpenDataLab MinerU

MinerU, from OpenDataLab and Shanghai AI Laboratory, converts PDF, picture, DOCX, PPTX, and XLSX inputs into Markdown and JSON. It pairs a processing pipeline with a vision-language mannequin. The present mannequin, MinerU2.5-Pro, targets high-resolution parsing of advanced layouts, together with cross-page tables and charts.

MinerU just lately modified its license. It moved from AGPL-3.0 to the “MinerU Open Source License,” a customized license based mostly on Apache 2.0 with further situations. That change lowers friction for business deployment.

Datalab Marker

Marker is Datalab’s pipeline for changing paperwork into Markdown, JSON, chunks, and HTML. It helps PDF, picture, PPTX, DOCX, XLSX, HTML, and EPUB. It codecs tables, varieties, equations, inline math, hyperlinks, and code. An non-obligatory --use_llm flag provides a language mannequin to enhance tables and varieties.

On the third-party olmOCR-Bench suite, Marker scores round 76.1. Its code is GPL-3.0, and its mannequin weights use a modified AI Pubs OpenRAIL-M license. That weight license is free for analysis, private use, and startups beneath $2M in funding or income. Datalab’s managed platform now runs a more recent OCR mannequin, Chandra, which is Apache-2.0 and outputs HTML, Markdown, and JSON.

Ai2 olmOCR 2

olmOCR 2 is a 7B OCR-specialized vision-language mannequin from the Allen Institute for AI (Ai2). It converts PDFs into clear textual content and Markdown whereas preserving studying order. It handles tables, equations, and handwriting throughout advanced multi-column layouts. The mannequin is educated with reinforcement studying from verifiable rewards, utilizing artificial unit assessments because the reward sign.

olmOCR 2 scores 82.4 by itself olmOCR-Bench, among the many increased printed outcomes on that suite. Ai2 estimates a price of roughly $178 per million pages by yourself GPUs. The toolkit and the allenai/olmOCR-2-7B-1025 weights are Apache-2.0. The present mannequin is English-focused.

DeepSeek DeepSeek-OCR

DeepSeek-OCR is an open OCR mannequin from DeepSeek, launched in October 2025. It introduces “contexts optical compression,” which represents text-rich pages as compact imaginative and prescient tokens, then decodes them again to textual content. This lets it course of lengthy paperwork with far fewer tokens than typical vision-language fashions.

It makes use of a DeepEncoder plus a 3B Mixture-of-Experts decoder that prompts about 570M parameters per token. Depending on the immediate, it outputs plain textual content, Markdown, HTML tables, or structured JSON, and it helps 100+ languages. The code is launched beneath the MIT license. A follow-up, DeepSeek-OCR2, arrived in January 2026.

The general-purpose possibility: Qwen3-VL

Qwen3-VL from Alibaba shouldn’t be a document-specific mannequin. It is a basic multimodal collection that many extraction fashions use as a base. You can immediate it to return Markdown, JSON, or code from a web page. Most sizes ship beneath Apache 2.0. It is a versatile fallback when a specialised mannequin doesn’t match, although it wants extra immediate engineering and provides fewer output ensures.

How the choices evaluate

Model Org Size What it does Primary output License
lift Datalab 9B Schema-driven extraction JSON to your schema Apache-2.0 code / OpenRAIL-M weights
NuExtract 3 NuMind 4B Schema extraction + OCR JSON + Markdown Open weights (see card)
Docling IBM / LF AI & Data Pipeline Layout parsing Markdown, JSON, DocTags MIT
Granite-Docling IBM 258M One-shot conversion DocTags, Markdown Apache-2.0
MinerU OpenDataLab ~1.2B VLM Layout parsing Markdown, JSON MinerU Open Source License
Marker Datalab Pipeline Layout parsing Markdown, JSON, HTML GPL-3.0 code / OpenRAIL-M weights
olmOCR 2 Ai2 7B OCR to textual content Plain textual content, Markdown Apache-2.0
DeepSeek-OCR DeepSeek 3B MoE (~570M lively) OCR with token compression Text, Markdown, JSON MIT (code)
Qwen3-VL Alibaba 2B–235B General VLM Markdown, JSON, code Apache-2.0 (most sizes)

A be aware on benchmarks: these numbers come from completely different suites and will not be instantly comparable. raise’s 90.2% is discipline accuracy on Datalab’s schema-extraction benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content material extraction with unit-test scoring. Run your personal paperwork via every candidate earlier than deciding.


Marktechpost Explainer

Open-Source Document Extraction Models for Structured PDF-to-JSON

“PDF to JSON” hides two completely different jobs. Schema-driven extraction fills fields you outline. Document parsing rebuilds the web page into JSON or Markdown. Filter by job and license, then open any repo.

Schema-driven extraction
Document parsing
General-purpose VLM
Task




License



Benchmarks will not be instantly comparable. raise’s 90.2% is discipline accuracy on Datalab’s schema benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content material extraction with unit assessments. Run your personal paperwork earlier than selecting.
Marktechpost · AI Media Inc.
Verified from main sources · July 2026

Similar Posts