|

How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document Intelligence

In this tutorial, we construct a workflow for utilizing Docling Parse to analyze PDF paperwork at a detailed structural stage. We begin by making ready a steady Python atmosphere, dealing with widespread Colab dependency points, and producing a customized multi-page PDF with textual content, columns, table-like content material, vector shapes, and an embedded picture. We then use Docling Parse to extract phrases, characters, and features with page-level coordinates, render visible overlays, and save the outcomes into structured JSON and CSV information. Through this workflow, we see how low-level PDF parsing can help doc AI duties akin to format evaluation, reading-order reconstruction, table-aware processing, and retrieval-ready doc preparation.

Setting Up the Docling Parse Colab Environment and Dependencies

import os, sys, subprocess, textwrap, json, time, shutil
from pathlib import Path
def run(cmd):
   print(f"n$ {cmd}")
   return subprocess.run(cmd, shell=True, textual content=True, capture_output=False)
run(f'{sys.executable} -m pip set up -q --no-cache-dir -U "pillow>=10.4.0,<12" reportlab pandas matplotlib docling-core docling-parse')
attempt:
   from PIL import Image, ImageDraw
besides ImportError:
   print("nPillow import failed as a result of Colab has a combined PIL set up.")
   print("Reinstalling Pillow and restarting runtime. After restart, run this identical cell once more.")
   run(f'{sys.executable} -m pip uninstall -y pillow PIL')
   run(f'{sys.executable} -m pip set up -q --no-cache-dir --force-reinstall "pillow>=10.4.0,<12"')
   os.kill(os.getpid(), 9)
import pandas as pd
import matplotlib.pyplot as plt
from reportlab.lib.pagesizes import A4
from reportlab.lib import colours
from reportlab.platypus import Table, TableType
from reportlab.pdfgen import canvas
from docling_core.varieties.doc.web page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser
print("Environment prepared.")
print("Python:", sys.model.break up()[0])
WORKDIR = Path("/content material/docling_parse_advanced_tutorial")
WORKDIR.mkdir(mother and father=True, exist_ok=True)
PDF_PATH = WORKDIR / "advanced_docling_parse_demo.pdf"
OUT_DIR = WORKDIR / "outputs"
OUT_DIR.mkdir(exist_ok=True)
DEMO_IMAGE_PATH = WORKDIR / "demo_bitmap.png"

We arrange the Colab atmosphere by putting in Docling Parse, Docling Core, Pillow, ReportLab, Pandas, and Matplotlib. We additionally deal with the Pillow import concern safely so the pocket book can recuperate if Colab has a damaged or combined PIL set up. We then outline the working listing, output folder, PDF path, and picture path that we use all through the tutorial.

Generating a Multi-Element Test PDF for Parser Evaluation

def create_demo_image(path):
   img = Image.new("RGB", (320, 180), "white")
   draw = ImageDraw.Draw(img)
   draw.rectangle([20, 20, 300, 160], define="black", width=3)
   draw.ellipse([55, 45, 145, 135], define="black", width=4)
   draw.line([180, 140, 285, 45], fill="black", width=4)
   draw.textual content((45, 145), "Embedded bitmap picture", fill="black")
   img.save(path)
create_demo_image(DEMO_IMAGE_PATH)
def build_pdf(pdf_path):
   c = canvas.Canvas(str(pdf_path), pagesize=A4)
   width, peak = A4
   c.setFont("Helvetica-Bold", 20)
   c.drawString(60, peak - 70, "Docling Parse Advanced PDF Parsing Tutorial")
   c.setFont("Helvetica", 11)
   intro = (
       "This generated doc is designed for testing textual content extraction, coordinate parsing, "
       "line grouping, vector path detection, bitmap assets, and layout-aware reconstruction."
   )
   text_obj = c.startText(60, peak - 105)
   text_obj.setLeading(15)
   for line in textwrap.wrap(intro, width=90):
       text_obj.textual contentLine(line)
   c.drawText(text_obj)
   c.setFont("Helvetica-Bold", 14)
   c.drawString(60, peak - 170, "1. Two-column textual content area")
   left_para = (
       "The left column incorporates compact explanatory textual content. A parser ought to expose phrases, "
       "characters, and line-level cells alongside with coordinates. These coordinates permit us "
       "to reconstruct studying order and examine the spatial construction of a web page."
   )
   right_para = (
       "The proper column incorporates a separate paragraph. In doc AI pipelines, format "
       "options are helpful for retrieval, desk extraction, chunking, and downstream RAG "
       "functions the place web page place can matter."
   )
   y_start = peak - 200
   left_text = c.startText(60, y_start)
   left_text.setFont("Helvetica", 10)
   left_text.setLeading(13)
   for line in textwrap.wrap(left_para, width=42):
       left_text.textual contentLine(line)
   c.drawText(left_text)
   right_text = c.startText(325, y_start)
   right_text.setFont("Helvetica", 10)
   right_text.setLeading(13)
   for line in textwrap.wrap(right_para, width=42):
       right_text.textual contentLine(line)
   c.drawText(right_text)
   c.setStrokeColor(colours.darkblue)
   c.setLineWidth(2)
   c.rect(55, peak - 315, 225, 130, stroke=1, fill=0)
   c.rect(320, peak - 315, 225, 130, stroke=1, fill=0)
   c.setStrokeColor(colours.darkgreen)
   c.setLineWidth(3)
   c.circle(140, peak - 390, 40, stroke=1, fill=0)
   c.line(220, peak - 430, 310, peak - 355)
   c.setFont("Helvetica-Bold", 14)
   c.setFillColor(colours.black)
   c.drawString(60, peak - 470, "2. Simple table-like construction")
   information = [
       ["Section", "Signal", "Expected parser behavior"],
       ["Text", "Words and lines", "Return text cells with coordinates"],
       ["Vector", "Boxes and lines", "Expose page path/vector resources"],
       ["Bitmap", "Embedded image", "Expose or render image resources"],
   ]
   desk = Table(information, colWidths=[100, 130, 260])
   desk.setStyle(TableType([
       ("BACKGROUND", (0, 0), (-1, 0), colors.lightgrey),
       ("GRID", (0, 0), (-1, -1), 0.7, colors.black),
       ("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),
       ("FONTSIZE", (0, 0), (-1, -1), 9),
       ("VALIGN", (0, 0), (-1, -1), "MIDDLE"),
   ]))
   desk.wrapOn(c, width, peak)
   desk.drawOn(c, 60, peak - 590)
   c.setFont("Helvetica", 9)
   c.drawString(60, 55, "Page 1: generated programmatic PDF with textual content, table-like format, and vector paths.")
   c.presentPage()
   c.setFont("Helvetica-Bold", 18)
   c.drawString(60, peak - 70, "Page 2: Bitmap, Dense Text, and Reading Order")
   c.setFont("Helvetica", 10)
   dense = (
       "This web page consists of an embedded bitmap picture and a number of other brief blocks of textual content. "
       "We use it to take a look at whether or not rendering works, whether or not the parser preserves page-level "
       "coordinates, and whether or not our personal reconstruction logic can group phrases into strains."
   )
   y = peak - 105
   for para_idx in vary(4):
       tx = c.startText(60, y)
       tx.setFont("Helvetica", 10)
       tx.setLeading(13)
       for line in textwrap.wrap(f"Block {para_idx + 1}: {dense}", width=92):
           tx.textual contentLine(line)
       c.drawText(tx)
       y -= 70
   c.drawImage(str(DEMO_IMAGE_PATH), 110, peak - 510, width=320, peak=180, preserveAspectRatio=True)
   c.setStrokeColor(colours.pink)
   c.setLineWidth(2)
   c.roundRect(95, peak - 525, 350, 210, 10, stroke=1, fill=0)
   c.setFillColor(colours.black)
   c.setFont("Helvetica-Bold", 12)
   c.drawString(60, peak - 570, "Coordinate-aware extraction lets us preserve web page, textual content, and place collectively.")
   c.setFont("Helvetica", 9)
   c.drawString(60, 55, "Page 2: embedded bitmap picture and a number of textual content blocks.")
   c.save()
build_pdf(PDF_PATH)
print("Created PDF:", PDF_PATH)

We generate a small bitmap picture and create a customized two-page PDF for testing Docling Parse. We add textual content blocks, two-column content material, vector shapes, table-like content material, and an embedded picture so the parser has a number of doc parts to course of. We use the generated PDF as a managed enter to examine textual content extraction, format construction, rendering, and coordinate-aware parsing.

Extracting Word, Character, and Line Cells with Docling Parse

def safe_to_dict(obj, max_depth=2):
   if obj is None:
       return None
   if isinstance(obj, (str, int, float, bool)):
       return obj
   if isinstance(obj, (record, tuple)):
       return [safe_to_dict(x, max_depth=max_depth - 1) for x in obj[:50]]
   if isinstance(obj, dict):
       return {
           str(okay): safe_to_dict(v, max_depth=max_depth - 1)
           for okay, v in record(obj.objects())[:50]
       }
   if hasattr(obj, "model_dump"):
       attempt:
           return obj.model_dump()
       besides Exception:
           go
   if hasattr(obj, "__dict__") and max_depth > 0:
       attempt:
           return {
               okay: safe_to_dict(v, max_depth=max_depth - 1)
               for okay, v in obj.__dict__.objects()
               if not okay.startswith("_")
           }
       besides Exception:
           go
   return str(obj)
def rect_to_dict(rect):
   d = safe_to_dict(rect)
   if isinstance(d, dict):
       return d
   attrs = {}
   for identify in [
       "l", "t", "r", "b",
       "left", "top", "right", "bottom",
       "x0", "y0", "x1", "y1",
       "width", "height"
   ]:
       if hasattr(rect, identify):
           attempt:
               attrs[name] = getattr(rect, identify)
           besides Exception:
               go
   return attrs if attrs else {"uncooked": str(rect)}
def get_text_cell_records(page_no, pred_page, unit_type):
   data = []
   attempt:
       cells = record(pred_page.iterate_cells(unit_type=unit_type))
   besides Exception as e:
       print(f"Could not iterate {unit_type} cells on web page {page_no}: {e}")
       return data
   for idx, cell in enumerate(cells):
       textual content = getattr(cell, "textual content", "")
       rect = getattr(cell, "rect", None)
       data.append({
           "web page": page_no,
           "unit": str(unit_type).break up(".")[-1],
           "index": idx,
           "textual content": textual content,
           "rect": rect_to_dict(rect),
           "raw_cell": safe_to_dict(cell, max_depth=1),
       })
   return data
def count_possible_resources(pred_page):
   resource_summary = {}
   names = dir(pred_page)
   key phrases = ["path", "bitmap", "image", "resource", "line", "rect"]
   for identify in names:
       lname = identify.decrease()
       if any(okay in lname for okay in key phrases) and never identify.startswith("_"):
           attempt:
               worth = getattr(pred_page, identify)
               if callable(worth):
                   proceed
               attempt:
                   resource_summary[name] = len(worth)
               besides Exception:
                   resource_summary[name] = kind(worth).__name__
           besides Exception:
               go
   return resource_summary
parser = DoclingPdfParser()
begin = time.perf_counter()
pdf_doc = parser.load(path_or_stream=str(PDF_PATH))
load_time = time.perf_counter() - begin
print(f"nLoaded PDF in {load_time:.3f} seconds.")
all_records = []
page_summaries = []
rendered_paths = []
parse_start = time.perf_counter()
for page_no, pred_page in pdf_doc.iterate_pages():
   print(f"n--- Page {page_no} ---")
   word_records = get_text_cell_records(page_no, pred_page, TextCellUnit.WORD)
   char_records = get_text_cell_records(page_no, pred_page, TextCellUnit.CHAR)
   line_records = get_text_cell_records(page_no, pred_page, TextCellUnit.LINE)
   all_records.lengthen(word_records)
   all_records.lengthen(char_records)
   all_records.lengthen(line_records)
   resource_summary = count_possible_resources(pred_page)
   page_summaries.append({
       "web page": page_no,
       "phrases": len(word_records),
       "chars": len(char_records),
       "strains": len(line_records),
       "possible_resource_attributes": resource_summary,
   })
   print("Words:", len(word_records))
   print("Characters:", len(char_records))
   print("Lines:", len(line_records))
   print("Possible useful resource attributes:", resource_summary)
   print("nFirst 20 extracted phrases:")
   print(" ".be a part of([r["text"] for r in word_records[:20]]))
   for unit_name, unit_type in [
       ("word", TextCellUnit.WORD),
       ("char", TextCellUnit.CHAR),
       ("line", TextCellUnit.LINE),
   ]:
       attempt:
           img = pred_page.render_as_image(cell_unit=unit_type)
           out_img = OUT_DIR / f"page_{page_no}_{unit_name}_overlay.png"
           img.save(out_img)
           rendered_paths.append(out_img)
           print("Saved rendered overlay:", out_img)
       besides Exception as e:
           print(f"Could not render {unit_name} overlay for web page {page_no}: {e}")
parse_time = time.perf_counter() - parse_start

We outline helper capabilities to safely convert Docling objects, rectangles, and web page assets into readable Python dictionaries. We load the generated PDF with DoclingPdfParser and extract word-level, character-level, and line-level textual content cells from every web page. We additionally render web page overlays for totally different textual content items to visually examine how Docling Parse detects and maps content material on PDF pages.

Exporting Structured Outputs and Reconstructing Layout-Aware Text

records_path = OUT_DIR / "docling_parse_cells.json"
with open(records_path, "w", encoding="utf-8") as f:
   json.dump(all_records, f, indent=2, ensure_ascii=False)
summary_path = OUT_DIR / "page_summaries.json"
with open(summary_path, "w", encoding="utf-8") as f:
   json.dump(page_summaries, f, indent=2, ensure_ascii=False)
flat_rows = []
for r in all_records:
   rect = r.get("rect", {})
   row = {
       "web page": r["page"],
       "unit": r["unit"],
       "index": r["index"],
       "textual content": r["text"],
   }
   if isinstance(rect, dict):
       for okay, v in rect.objects():
           if isinstance(v, (str, int, float, bool)) or v is None:
               row[f"rect_{k}"] = v
           else:
               row[f"rect_{k}"] = str(v)
   flat_rows.append(row)
df = pd.DataBody(flat_rows)
csv_path = OUT_DIR / "docling_parse_cells.csv"
df.to_csv(csv_path, index=False)
summary_df = pd.DataBody(page_summaries)
summary_csv_path = OUT_DIR / "page_summaries.csv"
summary_df.to_csv(summary_csv_path, index=False)
print("nSaved structured outputs:")
print(records_path)
print(csv_path)
print(summary_path)
print(summary_csv_path)
print("nPage abstract:")
show(summary_df)
print("nCell dataframe pattern:")
show(df.head(20))
def extract_rect_numbers(rect):
   if not isinstance(rect, dict):
       return None
   possible_sets = [
       ("l", "t", "r", "b"),
       ("left", "top", "right", "bottom"),
       ("x0", "y0", "x1", "y1"),
   ]
   for keys in possible_sets:
       if all(okay in rect for okay in keys):
           attempt:
               vals = [float(rect[k]) for okay in keys]
               return vals
           besides Exception:
               go
   numeric = []
   for v in rect.values():
       attempt:
           numeric.append(float(v))
       besides Exception:
           go
   if len(numeric) >= 4:
       return numeric[:4]
   return None
word_df = df[df["unit"].str.incorporates("WORD", case=False, na=False)].copy()
if len(word_df) == 0:
   word_df = df[df["unit"].str.incorporates("phrase", case=False, na=False)].copy()
coords = []
for _, row in word_df.iterrows():
   rect_data = {}
   for col in word_df.columns:
       if col.startswith("rect_"):
           rect_data[col.replace("rect_", "")] = row[col]
   nums = extract_rect_numbers(rect_data)
   coords.append(nums)
word_df["coord_numbers"] = coords
word_df = word_df[word_df["coord_numbers"].notna()].copy()
if len(word_df) > 0:
   word_df["x0"] = word_df["coord_numbers"].apply(lambda x: min(x[0], x[2]))
   word_df["x1"] = word_df["coord_numbers"].apply(lambda x: max(x[0], x[2]))
   word_df["y0"] = word_df["coord_numbers"].apply(lambda x: min(x[1], x[3]))
   word_df["y1"] = word_df["coord_numbers"].apply(lambda x: max(x[1], x[3]))
   word_df["y_mid"] = (word_df["y0"] + word_df["y1"]) / 2
   reconstructed_pages = {}
   for web page, g in word_df.groupby("web page"):
       g = g.sort_values(["y_mid", "x0"]).copy()
       y_values = sorted(g["y_mid"].tolist())
       line_bins = []
       threshold = 8.0
       for y in y_values:
           positioned = False
           for line in line_bins:
               if abs(line["center"] - y) <= threshold:
                   line["values"].append(y)
                   line["center"] = sum(line["values"]) / len(line["values"])
                   positioned = True
                   break
           if not positioned:
               line_bins.append({"middle": y, "values": [y]})
       def assign_line(y):
           return min(vary(len(line_bins)), key=lambda i: abs(line_bins[i]["center"] - y))
       g["line_id"] = g["y_mid"].apply(assign_line)
       strains = []
       for line_id, lg in g.groupby("line_id"):
           lg = lg.sort_values("x0")
           line_text = " ".be a part of(lg["text"].astype(str).tolist())
           strains.append((lg["y_mid"].imply(), line_text))
       strains = sorted(strains, key=lambda x: x[0])
       reconstructed_text = "n".be a part of([line for _, line in lines])
       reconstructed_pages[int(page)] = reconstructed_text
   recon_path = OUT_DIR / "layout_aware_reconstructed_text.json"
   with open(recon_path, "w", encoding="utf-8") as f:
       json.dump(reconstructed_pages, f, indent=2, ensure_ascii=False)
   print("nLayout-aware reconstructed textual content:")
   for web page, textual content in reconstructed_pages.objects():
       print(f"n===== PAGE {web page} =====")
       print(textual content[:2500])
   print("nSaved reconstruction:", recon_path)
else:
   print("nCould not construct coordinate-based reconstruction as a result of rectangle coordinates weren't uncovered in a numeric type.")

We save the extracted parsing outcomes into JSON and CSV information for later evaluation. We flatten the parsed data into a Pandas DataBody and show each the web page abstract and cell-level extraction pattern. We additionally reconstruct textual content from coordinate info, which helps us perceive how a layout-aware studying order will be derived from phrase positions.

Benchmarking Threaded Parsing and Checking CLI Availability

print("nAttempting threaded parsing benchmark...")
threaded_results = []
threaded_available = True
attempt:
   from docling_parse.pdf_parser import DoclingThreadedPdfParser, ThreadedPdfParserConfig
   from docling_parse.pdf_parsers import DecodePageConfig
   parser_config = ThreadedPdfParserConfig(
       loglevel="deadly",
       threads=4,
       max_concurrent_results=32,
   )
   decode_config = DecodePageConfig()
   threaded_parser = DoclingThreadedPdfParser(
       parser_config=parser_config,
       decode_config=decode_config,
   )
   t0 = time.perf_counter()
   doc_key = threaded_parser.load(str(PDF_PATH))
   page_count = threaded_parser.page_count(doc_key)
   print("Threaded doc key:", doc_key)
   print("Threaded web page rely:", page_count)
   for end in threaded_parser.iterate_results():
       merchandise = {
           "doc_key": str(getattr(outcome, "doc_key", "")),
           "page_number": getattr(outcome, "page_number", None),
           "success": getattr(outcome, "success", None),
           "error_message": getattr(outcome, "error_message", None),
       }
       if getattr(outcome, "success", False):
           seg_page = outcome.get_page()
           timings = outcome.get_timings()
           merchandise["word_count"] = len(getattr(seg_page, "word_cells", []))
           attempt:
               merchandise["total_time"] = timings.complete()
           besides Exception:
               merchandise["total_time"] = str(timings)
       threaded_results.append(merchandise)
   threaded_time = time.perf_counter() - t0
besides Exception as e:
   threaded_available = False
   threaded_time = None
   print("Threaded parser isn't obtainable or failed on this atmosphere.")
   print("Error:", repr(e))
if threaded_available:
   threaded_path = OUT_DIR / "threaded_parse_results.json"
   with open(threaded_path, "w", encoding="utf-8") as f:
       json.dump(threaded_results, f, indent=2, ensure_ascii=False)
   threaded_df = pd.DataBody(threaded_results)
   print("nThreaded parsing outcomes:")
   show(threaded_df)
   print("Saved threaded outcomes:", threaded_path)
benchmark = {
   "standard_load_time_seconds": load_time,
   "standard_iterate_parse_time_seconds": parse_time,
   "threaded_total_time_seconds": threaded_time,
   "total_cells_extracted": len(all_records),
   "output_dir": str(OUT_DIR),
}
benchmark_path = OUT_DIR / "benchmark.json"
with open(benchmark_path, "w", encoding="utf-8") as f:
   json.dump(benchmark, f, indent=2)
print("nBenchmark:")
print(json.dumps(benchmark, indent=2))
print("nChecking CLI availability...")
cli = shutil.which("docling-parse")
print("docling-parse CLI:", cli)
if cli:
   cli_result = subprocess.run(
       "docling-parse -h",
       shell=True,
       textual content=True,
       capture_output=True,
   )
   print(cli_result.stdout[:1000])
else:
   print("CLI was not discovered on PATH, however the Python API labored.")
print("nGenerated information:")
for p in sorted(OUT_DIR.glob("*")):
   print(p)
print("nRendering saved overlay photos in pocket book...")
for img_path in rendered_paths[:6]:
   attempt:
       img = Image.open(img_path)
       plt.determine(figsize=(9, 12))
       plt.imshow(img)
       plt.axis("off")
       plt.title(img_path.identify)
       plt.present()
   besides Exception as e:
       print("Could not show:", img_path, e)
print("nTutorial full.")
print("Main outputs are saved in:", OUT_DIR)

We take a look at threaded parsing to see whether or not parallel web page processing is out there within the present atmosphere. We retailer threaded parsing outcomes, generate a benchmark abstract, examine whether or not the Docling Parse CLI is out there, and record all generated output information. We lastly show the rendered overlay photos contained in the pocket book so we are able to visually verify the parsing high quality.

Conclusion

In conclusion, we have now a full doc parsing pipeline that goes past easy textual content extraction. We created a take a look at PDF, parsed its textual content at a number of granularities, inspected spatial metadata, exported structured outputs, reconstructed layout-aware textual content, and benchmarked threaded parsing the place obtainable. It offers us a base for constructing doc intelligence methods that want each content material and format info. We can now lengthen this workflow to real-world PDFs, join the extracted information with downstream NLP or RAG pipelines, and use Docling Parse as a light-weight however highly effective part for structured PDF understanding.


Check out the Full Codes with NotebookAlso, be happy to observe us on Twitter and don’t neglect to be a part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document Intelligence appeared first on MarkTechPost.

Similar Posts