OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing
In this tutorial, we construct a sophisticated, self-contained OCRmyPDF workflow. We begin by putting in the required system and Python dependencies, then create an artificial image-only PDF for scanning so we are able to check OCR with out counting on exterior information. From there, we use OCRmyPDF’s actual public API to transform scanned paperwork into searchable PDFs, generate PDF/A outputs, extract sidecar textual content, validate the outcomes, evaluate file sizes, tune Tesseract settings, clear noisy scans, deal with already-OCRed information, course of pictures with DPI hints, run OCR in reminiscence, and batch-process a number of PDFs. Through this workflow, we perceive how OCRmyPDF can function a sensible doc digitization pipeline for archival, search, extraction, and automated processing duties.
Installing OCRmyPDF System Dependencies
import io
import os
import re
import sys
import time
import shutil
import logging
import textwrap
import subprocess
from pathlib import Path
INSTALL_JBIG2 = True
def sh(cmd: str, examine: bool = True) -> int:
"""Run a shell command, echo it, and present the tail of its output."""
print(f" $ {cmd}")
r = subprocess.run(cmd, shell=True, textual content=True,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
if r.stdout and r.stdout.strip():
for ln in r.stdout.strip().splitlines()[-12:]:
print(" " + ln)
if examine and r.returncode != 0:
increase RuntimeError(f"Command failed ({r.returncode}): {cmd}")
return r.returncode
def install_dependencies() -> None:
"""Install OCRmyPDF's system + Python dependencies for Colab/Ubuntu."""
apt_pkgs = (
"tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd "
"tesseract-ocr-deu tesseract-ocr-fra "
"ghostscript unpaper pngquant poppler-utils qpdf"
)
sh("apt-get replace -qq", examine=False)
sh(f"DEBIAN_FRONTEND=noninteractive apt-get set up -y -qq {apt_pkgs}")
sh(f'"{sys.executable}" -m pip set up -q --upgrade ocrmypdf img2pdf "pillow<12"')
if INSTALL_JBIG2 and shutil.which("jbig2") is None:
strive:
build_pkgs = ("autoconf automake libtool pkg-config "
"libleptonica-dev zlib1g-dev build-essential git")
sh(f"DEBIAN_FRONTEND=noninteractive apt-get set up -y -qq {build_pkgs}")
sh("rm -rf /tmp/jbig2enc && "
"git clone -q https://github.com/agl/jbig2enc.git /tmp/jbig2enc")
sh("cd /tmp/jbig2enc && ./autogen.sh >/dev/null 2>&1 && "
"./configure >/dev/null 2>&1 && make -j2 >/dev/null 2>&1 && "
"make set up >/dev/null 2>&1 && ldconfig")
print(" jbig2enc:",
"put in" if shutil.which("jbig2") else "constructed, however binary not on PATH")
besides Exception as e:
print(" jbig2enc construct skipped (optionally available):", e)
def ensure_installed() -> None:
have_tools = bool(shutil.which("tesseract") and shutil.which("gs"))
strive:
import ocrmypdf
import img2pdf
from PIL import Image
have_py = True
besides Exception:
have_py = False
if have_tools and have_py:
print("Dependencies already current — skipping set up.")
else:
print("Installing dependencies (first run can take a couple of minutes)...")
install_dependencies()
ensure_installed()
We arrange the whole OCRmyPDF setting for Google Colab by importing the required commonplace libraries and defining the set up workflow. We set up system instruments comparable to Tesseract, Ghostscript, unpaper, pngquant, poppler, and qpdf, alongside with Python packages like OCRmyPDF, img2pdf, and Pillow. We additionally optionally construct jbig2enc in order that superior PDF optimization can produce smaller outputs for scanned paperwork.
Loading OCRmyPDF and Building Synthetic Scans
def _purge(*prefixes):
for identify in [m for m in list(sys.modules)
if any(m == p or m.startswith(p + ".") for p in prefixes)]:
del sys.modules[name]
def _load_ocrmypdf():
_purge("PIL", "ocrmypdf")
import ocrmypdf
return ocrmypdf
strive:
ocrmypdf = _load_ocrmypdf()
besides ImportError as e:
if "_Inokay" in str(e) or "PIL" in str(e):
print("Repairing an incompatible Pillow (reinstalling pillow<12)...")
sh(f'"{sys.executable}" -m pip set up -q --force-reinstall "pillow<12"')
strive:
ocrmypdf = _load_ocrmypdf()
print("Pillow repaired — persevering with with out a restart.")
besides Exception:
increase RuntimeError(
"Pillow continues to be incompatible on this session. Use the Colab menu: "
"Runtime > Restart session, then run this cell once more."
)
else:
increase
from ocrmypdf.exceptions import (
ExitCode,
PriorOcrFoundError,
EncryptedPdfError,
MissingDependencyError,
TaggedPDFError,
DigitalSignatureError,
DpiError,
InputFileError,
UnsupportedImageFormatError,
)
from ocrmypdf.helpers import check_pdf
from ocrmypdf.pdfa import file_claims_pdfa
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ImageFilter
logging.basicConfig(stage=logging.WARNING, format="%(levelname)s: %(message)s")
logging.getLogger("ocrmypdf").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)
SAMPLE_TEXT_PAGES = [
"Optical Character Recognition, commonly abbreviated as OCR, is the "
"process of converting images of typed or printed text into machine "
"encoded text. This page was generated as a synthetic scan so that the "
"OCRmyPDF pipeline has something realistic to recognize and search.",
"On 14 March 2026 the archive contained 1,482 pages across 37 folders. "
"Roughly 92 percent of those pages were scanned at 200 to 300 dots per "
"inch. The remaining 8 percent were skewed and required deskewing before "
"any reliable recognition was possible.",
"After OCRmyPDF finishes, the output is a searchable PDF/A file. You can "
"select text, copy it, and run full text search across thousands of "
"documents. The original image resolution is preserved while a hidden "
"text layer is placed accurately underneath the page image.",
]
def _find_font():
for cand in (
"/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
"/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf",
):
if os.path.exists(cand):
return cand
return None
_FONT_PATH = _find_font()
FONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default()
def _add_speckle(img, n=6000, darkish=60):
"""Sprinkle mild darkish specks to mimic scanner noise (motivates --clean)."""
import random
px = img.load()
w, h = img.measurement
for _ in vary(n):
px[random.randint(0, w - 1), random.randint(0, h - 1)] = random.randint(0, darkish)
return img
def render_page(textual content, skew=False):
"""Render one A4 web page (1654x2339 px ≈ 200 DPI) of darkish textual content on white."""
W, H = 1654, 2339
img = Image.new("L", (W, H), 255)
draw = ImageDraw.Draw(img)
draw.multiline_text((150, 180), textwrap.fill(textual content, width=58),
fill=25, font=FONT, spacing=18)
if skew:
img = img.rotate(6, resample=Image.BICUBIC, increase=False, fillcolor=255)
img = img.filter(ImageFilter.GaussianBlur(0.6))
img = _add_speckle(img)
return img
def build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1):
"""Render pages to PNGs and wrap them losslessly into an image-only PDF."""
pngs = []
for i, textual content in enumerate(pages_text):
img = render_page(textual content, skew=(i == skew_index))
p = pdf_path.mother or father / f"_pg_{pdf_path.stem}_{i}.png"
img.save(p, format="PNG", dpi=(200, 200))
pngs.append(str(p))
with open(pdf_path, "wb") as f:
f.write(img2pdf.convert(pngs))
for p in pngs:
os.take away(p)
return pdf_path
def do_ocr(input_file, output_file, **kw):
"""Wrapper round ocrmypdf.ocr() that disables the progress bar and instances it."""
kw.setdefault("progress_bar", False)
t0 = time.perf_counter()
rc = ocrmypdf.ocr(input_file, output_file, **kw)
return rc, time.perf_counter() - t0
def tokens(s: str):
return re.findall(r"[a-z0-9]+", s.decrease())
def kb(path) -> str:
return f"{Path(path).stat().st_size / 1024:,.1f} KB"
def banner(title: str):
line = "─" * 74
print(f"n{line}n {title}n{line}")
We safely load OCRmyPDF and restore Pillow compatibility points if they seem within the Colab runtime. We import OCRmyPDF exceptions, PDF validation helpers, img2pdf, and Pillow utilities used all through the tutorial. We additionally outline the pattern doc textual content and helper capabilities for rendering artificial scanned pages, including scanner-like noise, constructing image-only PDFs, timing OCR runs, tokenizing textual content, formatting file sizes, and printing part banners.
Running Basic and Advanced PDF/A OCR
banner("0 · Environment")
print("Python :", sys.model.cut up()[0])
print("ocrmypdf:", ocrmypdf.__version__)
sh("tesseract --version", examine=False)
sh("gs --version", examine=False)
sh("tesseract --list-langs", examine=False)
print("unpaper :", shutil.which("unpaper"))
print("pngquant:", shutil.which("pngquant"))
print("jbig2 :", shutil.which("jbig2"), "(optionally available encoder)")
WORK = Path("/content material/ocrmypdf_demo")
strive:
WORK.mkdir(dad and mom=True, exist_ok=True)
besides Exception:
WORK = Path.cwd() / "ocrmypdf_demo"
WORK.mkdir(dad and mom=True, exist_ok=True)
print("Workdir :", WORK)
banner("1 · Build an artificial image-only 'scanned' PDF")
input_pdf = WORK / "scanned_input.pdf"
build_scanned_pdf(input_pdf, SAMPLE_TEXT_PAGES, skew_index=1)
print(f"Created {input_pdf.identify} ({kb(input_pdf)}, 3 pages; web page 2 is skewed + speckled)")
print("This PDF has NO textual content layer but — deciding on/looking out it returns nothing.")
banner("2 · Basic OCR (deskew + auto-rotate)")
out_basic = WORK / "out_basic.pdf"
rc, dt = do_ocr(
input_pdf, out_basic,
language=["eng"],
deskew=True,
rotate_pages=True,
)
print(f"Exit code: {rc.identify} ({int(rc)}) in {dt:.1f}s -> {out_basic.identify} ({kb(out_basic)})")
banner("3 · Advanced OCR (PDF/A-2, --optimize 3, sidecar, metadata)")
out_adv = WORK / "out_advanced.pdf"
sidecar = WORK / "ocr_text.txt"
rc, dt = do_ocr(
input_pdf, out_adv,
language=["eng"],
deskew=True,
rotate_pages=True,
optimize=3,
jpg_quality=80,
png_quality=80,
output_type="pdfa-2",
sidecar=sidecar,
title="OCRmyPDF Colab Tutorial",
creator="Tutorial",
topic="Demonstration of OCRmyPDF",
key phrases="ocr, pdf, tesseract, ocrmypdf",
)
print(f"Exit code: {rc.identify} ({int(rc)}) in {dt:.1f}s -> {out_adv.identify} ({kb(out_adv)})")
sh(f'pdfinfo "{out_adv}" | grep -E "Title|Author|Subject|Keywords|Pages"', examine=False)
We start the primary tutorial by printing the OCR setting particulars, together with Python, OCRmyPDF, Tesseract, Ghostscript, put in languages, and optionally available optimization instruments. We create a working listing and generate an artificial scanned PDF that has no searchable textual content layer. We then run each a primary OCR workflow and a sophisticated OCR workflow with PDF/A output, picture optimization, sidecar textual content technology, and doc metadata.
Validating Searchability and OCR Word-Recall
banner("4 · Prove searchability + measure OCR word-recall")
ocr_text = sidecar.read_text(errors="ignore")
print("Sidecar textual content (first 300 chars):n" + ocr_text[:300].strip())
embedded = WORK / "embedded_text.txt"
sh(f'pdftotext "{out_adv}" "{embedded}"', examine=False)
print(f"npdftotext extracted {len(embedded.read_text(errors='ignore').cut up())} "
f"phrases from the OUTPUT PDF (the enter had 0).")
src = tokens(" ".be part of(SAMPLE_TEXT_PAGES))
discovered = set(tokens(ocr_text))
recall = sum(1 for w in src if w in discovered) / max(1, len(src))
print(f"OCR word-recall vs. supply: {recall * 100:.1f}% ({len(src)} supply phrases)")
banner("5 · Validate output + measurement comparability")
print("check_pdf (legitimate PDF construction):", check_pdf(out_adv))
print("file_claims_pdfa (PDF/A marker):", file_claims_pdfa(out_adv))
print(f"enter : {kb(input_pdf)}")
print(f"primary : {kb(out_basic)}")
print(f"superior : {kb(out_adv)} (PDF/A-2 + picture optimisation)")
banner("6 · Modes & exceptions: skip-text / redo-ocr / force-ocr")
strive:
do_ocr(out_adv, WORK / "should_fail.pdf", language=["eng"])
print("Unexpected: no exception was raised.")
besides PriorOcrFoundError as e:
print(f"Caught PriorOcrFoundError (exit code {e.exit_code}): the PDF already "
f"has textual content. Choose a mode to override:")
rc, _ = do_ocr(out_adv, WORK / "out_skiptext.pdf", language=["eng"], skip_text=True)
print(f" --skip-text -> {rc.identify}")
rc, _ = do_ocr(out_adv, WORK / "out_redo.pdf", language=["eng"], redo_ocr=True)
print(f" --redo-ocr -> {rc.identify}")
rc, _ = do_ocr(out_adv, WORK / "out_force.pdf", language=["eng"], force_ocr=True)
print(f" --force-ocr -> {rc.identify}")
We show that OCR has made the scanned PDF searchable by studying the sidecar textual content and extracting embedded textual content from the output PDF utilizing pdftotext. We evaluate the recovered OCR textual content in opposition to the identified supply textual content to calculate a easy word-recall rating. We then validate the PDF construction, examine the PDF/A marker, evaluate file sizes, and display how OCRmyPDF handles information that already comprise OCR textual content utilizing skip-text, redo-OCR, and force-OCR modes.
Tuning, Cleaning, and In-Memory OCR
banner("7 · Tesseract engine tuning (--oem / --psm)")
rc, dt = do_ocr(
input_pdf, WORK / "out_tuned.pdf",
language=["eng"],
tesseract_oem=1,
tesseract_pagesegmode=3,
output_type="pdf",
)
print(f"Tuned run -> {rc.identify} in {dt:.1f}s")
banner("8 · Image cleansing with unpaper (--clean / --clean-final)")
strive:
rc, dt = do_ocr(
input_pdf, WORK / "out_cleaned.pdf",
language=["eng"], deskew=True,
clear=True, clean_final=True, output_type="pdf",
)
print(f"Cleaned run -> {rc.identify} in {dt:.1f}s")
besides Exception as e:
print("Cleaning step skipped (unpaper challenge):", sort(e).__name__, e)
banner("9 · Auto-orientation (OSD) on a 90°-rotated web page (--rotate-pages)")
strive:
rot_png = WORK / "_rot.png"
render_page(SAMPLE_TEXT_PAGES[0]).rotate(90, increase=True, fillcolor=255)
.save(rot_png, format="PNG", dpi=(200, 200))
rot_pdf = WORK / "rotated_input.pdf"
with open(rot_pdf, "wb") as f:
f.write(img2pdf.convert([str(rot_png)]))
os.take away(rot_png)
rot_side = WORK / "rotated_text.txt"
rc, dt = do_ocr(
rot_pdf, WORK / "out_rotated_fixed.pdf",
language=["eng"], rotate_pages=True, sidecar=rot_side, output_type="pdf",
)
n = len(rot_side.read_text(errors="ignore").cut up())
print(f"OSD corrected the web page; recovered {n} phrases -> {rc.identify} in {dt:.1f}s")
besides Exception as e:
print("Auto-orientation demo skipped:", sort(e).__name__, e)
banner("10 · OCR a single picture (image_dpi trace)")
single_png = WORK / "single_scan.png"
render_page(SAMPLE_TEXT_PAGES[2]).save(single_png, format="PNG")
rc, dt = do_ocr(
single_png, WORK / "out_from_image.pdf",
language=["eng"],
image_dpi=200,
output_type="pdf",
)
print(f"Image -> searchable PDF: {rc.identify} in {dt:.1f}s")
banner("11 · In-memory OCR with BytesIO streams")
in_io = io.BytesIO(input_pdf.read_bytes())
out_io = io.BytesIO()
ocrmypdf.ocr(in_io, out_io, language=["eng"], output_type="pdf", progress_bar=False)
out_bytes = out_io.getvalue()
(WORK / "out_in_memory.pdf").write_bytes(out_bytes)
print(f"OCR'd solely in RAM -> {len(out_bytes):,} bytes written to out_in_memory.pdf")
We experiment with Tesseract engine tuning by setting OCR engine mode and web page segmentation mode immediately by way of OCRmyPDF. We then use unpaper-based picture cleansing to enhance noisy scanned pages and optionally embed the cleaned picture into the ultimate output. We additionally check automated web page orientation correction, convert a single picture into a searchable PDF utilizing an express DPI trace, and run OCR solely in reminiscence utilizing BytesIO streams.
Batch OCR and the Typed OcrOptions API
banner("12 · Batch-process a folder of PDFs")
batch_in = WORK / "batch_in"
batch_out = WORK / "batch_out"
batch_in.mkdir(exist_ok=True)
batch_out.mkdir(exist_ok=True)
build_scanned_pdf(batch_in / "invoice_001.pdf",
[SAMPLE_TEXT_PAGES[0], SAMPLE_TEXT_PAGES[1]], skew_index=1)
build_scanned_pdf(batch_in / "memo_002.pdf",
[SAMPLE_TEXT_PAGES[2]], skew_index=-1)
print(f"{'file':<20}{'consequence':<14}{'time':<8}measurement")
for src_pdf in sorted(batch_in.glob("*.pdf")):
dst = batch_out / src_pdf.identify
strive:
rc, dt = do_ocr(src_pdf, dst, language=["eng"],
deskew=True, output_type="pdfa")
print(f"{src_pdf.identify:<20}{rc.identify:<14}{dt:<8.1f}{kb(dst)}")
besides Exception as e:
print(f"{src_pdf.identify:<20}{sort(e).__name__:<14}{'-':<8}-")
banner("13 · New-style typed OcrOptions API (v17+)")
strive:
from ocrmypdf._options import OcrOptions
opts = OcrOptions(
input_file=str(input_pdf),
output_file=str(WORK / "out_options.pdf"),
languages=["eng"],
deskew=True,
rotate_pages=True,
output_type="pdfa",
progress_bar=False,
)
rc = ocrmypdf.ocr(opts)
print(f"OcrOptions run -> {rc.identify} ({int(rc)})")
besides Exception as e:
print("OcrOptions API not out there on this model:", sort(e).__name__, e)
banner("14 · Results")
produced = sorted(p for p in WORK.glob("*.pdf"))
for p in produced:
print(f" {p.identify:<26}{kb(p)}")
for p in sorted(batch_out.glob("*.pdf")):
print(f" batch_out/{p.identify:<16}{kb(p)}")
print(f"nAll information are in: {WORK}")
strive:
from google.colab import information
for p in [out_adv, out_basic, sidecar, embedded]:
if Path(p).exists():
information.obtain(str(p))
besides Exception as e:
print("(Colab obtain unavailable — open the information from the panel as a substitute.)", e)
print("nDone.
")
We scale the workflow from a single file to folder-level batch processing by creating a number of artificial enter PDFs and OCRing every one into an output listing. We then strive the newer typed OcrOptions API, which permits us to cross validated OCR settings as a structured choices object. Also, we checklist all generated PDF outputs, together with batch outcomes, present the working listing path, and obtain key information.
Conclusion
In conclusion, we’ve a whole OCRmyPDF pipeline that goes far past primary scanned-PDF conversion. We created reasonable scanned inputs, utilized OCR with deskewing and rotation correction, generated optimized PDF/A information, verified embedded textual content, measured OCR recall, validated PDF construction, and experimented with a number of processing modes, together with skip-text, redo-OCR, and force-OCR. We additionally explored sensible manufacturing options, together with picture cleansing, Tesseract engine tuning, in-memory processing, and folder-level batch OCR.
Check out the Full Codes here. Also, be happy to comply with us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The publish OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing appeared first on MarkTechPost.
