Designing a Schema-Guided Invoice Intelligence Pipeline with lift-pdf for Accounts-Payable Extraction, Validation, and Ledger Generation
In this tutorial, we construct an end-to-end accounts-payable extraction pipeline with lift-pdf, utilizing artificial bill PDFs as managed take a look at paperwork and a structured JSON schema because the goal output format. Instead of treating bill parsing as a easy OCR job, we body it as schema-guided doc understanding: we generate lifelike invoices, outline fields similar to vendor id, billing social gathering, PO quantity, line objects, tax, complete quantity, steadiness due, and fee standing, and then ask the mannequin to extract these values immediately from the rendered PDF structure. We additionally embrace sensible extraction traps that seem in actual finance workflows, similar to distinguishing bill-to from ship-to, separating subtotal from after-tax complete, returning null for absent values, and accurately marking partially paid invoices as unpaid when a steadiness stays. Through GPU-aware model loading, optional 4-bit quantization, PDF generation and extraction, scoring, and ledger construction, we flip this tutorial into a compact but lifelike demonstration of doc intelligence for bill mining.
N_DOCS = 3
FORCE_FULL_PRECISION = False
FORCE_4BIT = False
SHOW_FIRST_PAGE = True
RUN_ON_REAL_PDF = False
REAL_PDF_URL = ""
REAL_PDF_PAGES = "0-1"
PIN_PILLOW = True
PILLOW_VERSION = "11.3.0"
import os, sys, subprocess, json, re, time, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def pip(*pkgs, improve=False):
"""Install with out invoking a shell (so '[hf]' is rarely glob-expanded)."""
args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if improve else []) + listing(pkgs)
print(" pip set up", *pkgs)
subprocess.run(args, examine=False)
print("STEP 1/7 · Installing carry + mild dependencies (first run is the gradual one)…")
pip("reportlab", "pypdfium2", "pandas", "matplotlib")
pip("lift-pdf[hf]")
pip("bitsandbytes", "speed up", improve=True)
if PIN_PILLOW:
pip(f"pillow=={PILLOW_VERSION}")
if "PIL" in sys.modules:
import PIL
if getattr(PIL, "__version__", "") != PILLOW_VERSION:
print(f" Pinned Pillow {PILLOW_VERSION} on disk, however a stale "
f"{getattr(PIL, '__version__', '?')} is loaded in reminiscence — restarting runtime.")
print(" Just re-run the cell(s) after Colab reconnects.")
os.kill(os.getpid(), 9)
print(" …set up completed.n")
import torch
We start by defining the runtime controls that determine what number of invoices we course of, whether or not we use 4-bit loading, whether or not we preview the generated PDF, and whether or not we later take a look at a actual bill. We set up the core dependencies for PDF technology, rendering, tabular evaluation, plotting, and lift-pdf inference. We additionally pin Pillow to a steady model as a result of the tutorial addresses a recognized Colab compatibility situation amongst Pillow, torchvision, and Transformers. This setup offers us a reproducible setting earlier than we load any mannequin or generate any doc.
def detect_gpu():
if not torch.cuda.is_available():
increase SystemExit(
"n✗ No CUDA GPU discovered. In Colab: Runtime ▸ Change runtime kind ▸ GPU "
"(A100 is finest; L4/T4 additionally work).n"
)
p = torch.cuda.get_device_properties(0)
cc = torch.cuda.get_device_capability(0)
return p.title, p.total_memory / 1e9, cc
def enable_4bit(compute_dtype):
"""Load carry's weights in 4-bit NF4 no matter transformers Auto* class it makes use of internally."""
import examine, functools, transformers
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=compute_dtype,
)
def patch(cls):
attempt:
cm = examine.getattr_static(cls, "from_pretrained")
orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm
besides Exception:
return
@functools.wraps(orig)
def inside(cls_, *args, **kwargs):
kwargs.setdefault("quantization_config", bnb)
kwargs.setdefault("device_map", {"": 0})
mannequin = orig(cls_, *args, **kwargs)
attempt:
mannequin.to = lambda *a, **okay: mannequin
mannequin.cuda = lambda *a, **okay: mannequin
besides Exception:
go
return mannequin
cls.from_pretrained = classmethod(inside)
for title in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM",
"AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]:
c = getattr(transformers, title, None)
if c shouldn't be None:
patch(c)
attempt:
from transformers.modeling_utils import PreTrainedModel
patch(PreTrainedModel)
besides Exception:
go
print("STEP 2/7 · Preparing the mannequin backend…")
gpu_name, vram, cc = detect_gpu()
use_4bit = FORCE_4BIT or (vram < 34 and not FORCE_FULL_PRECISION)
compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16
print(f" GPU: {gpu_name} | ~{vram:.0f} GB | compute functionality {cc[0]}.{cc[1]}")
print(f" Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})")
os.environ.setdefault("TORCH_DEVICE", "cuda:0")
os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/carry")
if use_4bit:
enable_4bit(compute_dtype)
from carry import extract
from carry.mannequin import InferenceManager
print(" Loading carry weights (≈20 GB obtain on first run)…")
_t = time.time()
MODEL = InferenceManager(methodology="hf")
print(f" ✓ mannequin prepared in {time.time() - _t:.0f}sn")
def run_lift(pdf_path, schema, page_range=None):
kw = {"mannequin": MODEL}
if page_range:
kw["page_range"] = page_range
outcome = extract(pdf_path, schema, **kw)
return getattr(outcome, "extraction", None)
We put together the GPU-aware inference backend and determine whether or not the mannequin ought to run in full precision or 4-bit NF4 quantization primarily based on obtainable VRAM. We patch the Hugging Face model-loading path so carry can transparently load the checkpoint with a BitsAndBytes quantization configuration when wanted. We initialize the InferenceManager as soon as and reuse it throughout all invoices, avoiding repeated model-loading overhead. Finally, we wrap carry.extract() inside a small helper so every PDF will be mined with the identical schema and elective web page vary.
DOCS = [
dict(
invoice_number="INV-2026-0412",
invoice_date="2026-05-04", due_date="2026-06-03",
vendor_name="Cloudworks Inc.",
vendor_address="500 Market St, Suite 900, San Francisco, CA 94105, USA",
bill_to_name="Acme Robotics LLC",
bill_to_address="12 Foundry Rd, Pittsburgh, PA 15222, USA",
ship_to_name="Acme Robotics — Warehouse 4",
ship_to_address="88 Dockside Blvd, Newark, NJ 07114, USA",
po_number=None,
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.085,
amount_paid=0.00,
line_items=[
("Cloud Compute — Standard tier (monthly)", 3, 240.00),
("Object Storage — 2 TB", 1, 46.00),
("Priority Support add-on", 1, 99.00),
],
notes="Payment due inside 30 days. Late funds accrue 1.5% month-to-month curiosity.",
),
dict(
invoice_number="INV-ND-2026-118",
invoice_date="2026-04-18", due_date="2026-05-18",
vendor_name="Nordic Design Studio Oy",
vendor_address="Eteläranta 12, 00130 Helsinki, Finland",
bill_to_name="Helsinki Media Oy",
bill_to_address="Mannerheimintie 4, 00100 Helsinki, Finland",
ship_to_name=None, ship_to_address=None,
po_number="PO-HM-5589",
discount_amount=785.00,
currency_code="EUR", currency_symbol="€",
tax_rate=0.24,
amount_paid=8760.60,
line_items=[
("Brand identity design package", 1, 4200.00),
("Web UI design — 12 screens", 12, 180.00),
("Custom illustration set", 1, 850.00),
("Design-system documentation", 1, 640.00),
],
notes="Paid in full — thanks. All quantities in EUR.",
),
dict(
invoice_number="INV-BR-4471",
invoice_date="2026-06-01", due_date="2026-07-15",
vendor_name="BuildRight Contractors Inc.",
vendor_address="740 Industrial Way, Austin, TX 78744, USA",
bill_to_name="Sunrise Property Group",
bill_to_address="9 Lakeview Terrace, Austin, TX 78703, USA",
ship_to_name="Sunrise Property Group — Lot 14 website workplace",
ship_to_address="Parcel 14, Mesa Ridge Development, Austin, TX 78737, USA",
po_number="PO-SPG-2211",
discount_amount=None,
currency_code="USD", currency_symbol="$",
tax_rate=0.07,
amount_paid=15000.00,
line_items=[
("Site preparation and grading", 1, 18500.00),
("Foundation concrete pour (Phase 1)", 1, 27400.00),
],
notes="A 15,000 USD deposit has been obtained. Remaining steadiness due by the date above.",
),
][:N_DOCS]
def compute(d):
"""Derive each cash determine as soon as, so PDF textual content and floor fact are assured equivalent."""
objects = [(desc, q, up, round(q * up, 2)) for (desc, q, up) in d["line_items"]]
subtotal = spherical(sum(t for *_, t in objects), 2)
disc = d.get("discount_amount")
taxable = spherical(subtotal - (disc or 0.0), 2)
tax = spherical(taxable * d["tax_rate"], 2)
complete = spherical(taxable + tax, 2)
paid = spherical(d.get("amount_paid", 0.0), 2)
steadiness = spherical(complete - paid, 2)
return dict(objects=objects, subtotal=subtotal, low cost=disc, tax=tax,
complete=complete, amount_paid=paid, steadiness=steadiness, is_paid=(steadiness <= 0.005))
def ground_truth(d):
"""Reshape uncooked inputs + computed totals into the precise JSON form our schema asks for."""
c = compute(d)
return {
"invoice_number": d["invoice_number"],
"invoice_date": d["invoice_date"],
"due_date": d["due_date"],
"vendor": {"title": d["vendor_name"], "deal with": d["vendor_address"]},
"customer_name": d["bill_to_name"],
"purchase_order_number": d.get("po_number"),
"forex": d["currency_code"],
"line_items": [{"description": desc, "quantity": q,
"unit_price": up, "line_total": t} for (desc, q, up, t) in c["items"]],
"subtotal": c["subtotal"],
"discount_amount": c["discount"],
"tax_amount": c["tax"],
"total_amount": c["total"],
"amount_paid": c["amount_paid"],
"balance_due": c["balance"],
"is_paid": c["is_paid"],
}
We outline a managed artificial bill corpus that mimics lifelike accounts-payable paperwork throughout completely different distributors, currencies, fee states, and bill layouts. Each bill consists of uncooked enterprise fields similar to vendor particulars, bill-to and ship-to events, PO numbers, reductions, taxes, deposits, and line objects. We then compute derived monetary values similar to subtotal, tax, complete, steadiness due, and paid standing from the uncooked bill information. This ensures the rendered PDF and the ground-truth JSON stay mathematically constant.
def render_pdf(d, path):
"""Draw a lifelike one-page bill: header, meta, invoice/ship, line objects, totals, fee."""
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.types import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.models import inch
from reportlab.lib import colours
from reportlab.platypus import (SimpleDocTemplate, Paragraph, Spacer,
Table, TableStyle)
c = compute(d)
sym = d["currency_symbol"]
def cash(x): return f"{sym}{x:,.2f}"
ss = getSampleStyleSheet()
H1 = ParagraphStyle("H1", father or mother=ss["Title"], fontSize=18, main=22, spaceAfter=2)
SMALL= ParagraphStyle("SM", father or mother=ss["Normal"], fontSize=8.5, textColor=colours.gray, main=11)
LBL = ParagraphStyle("LBL", father or mother=ss["Normal"], fontSize=8.5, textColor=colours.HexColor("#2b3a67"),
spaceAfter=1, fontName="Helvetica-Bold")
BODY = ParagraphStyle("BODY", father or mother=ss["Normal"], fontSize=9.5, main=13)
RIGHT= ParagraphStyle("R", father or mother=ss["Normal"], fontSize=16, main=18, alignment=2,
textColor=colours.HexColor("#2b3a67"), fontName="Helvetica-Bold")
story = []
head = Table([[
[Paragraph(d["vendor_name"], H1), Paragraph(d["vendor_address"], SMALL)],
[Paragraph("INVOICE", RIGHT),
Paragraph(f"{d['invoice_number']}", ParagraphStyle('n', father or mother=SMALL, alignment=2, fontSize=9.5))],
]], colWidths=[4.2 * inch, 2.8 * inch])
head.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP")]))
story += [head, Spacer(1, 10)]
meta_rows = [["Invoice date", d["invoice_date"], "Due date", d["due_date"]]]
if d.get("po_number"):
meta_rows.append(["PO number", d["po_number"], "Currency", d["currency_code"]])
else:
meta_rows.append(["Currency", d["currency_code"], "", ""])
meta = Table(meta_rows, colWidths=[1.3 * inch, 2.2 * inch, 1.3 * inch, 2.2 * inch])
meta.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 9),
("TEXTCOLOR", (0, 0), (0, -1), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (2, 0), (2, -1), colors.HexColor("#2b3a67")),
("FONTNAME", (0, 0), (0, -1), "Helvetica-Bold"),
("FONTNAME", (2, 0), (2, -1), "Helvetica-Bold"),
("BOTTOMPADDING", (0, 0), (-1, -1), 3), ("TOPPADDING", (0, 0), (-1, -1), 3)]))
story += [meta, Spacer(1, 12)]
invoice = [Paragraph("BILL TO", LBL), Paragraph(d["bill_to_name"], BODY),
Paragraph(d["bill_to_address"], SMALL)]
if d.get("ship_to_name"):
ship = [Paragraph("SHIP TO", LBL), Paragraph(d["ship_to_name"], BODY),
Paragraph(d["ship_to_address"], SMALL)]
else:
ship = [Paragraph("SHIP TO", LBL), Paragraph("Same as billing address", SMALL)]
events = Table([[bill, ship]], colWidths=[3.5 * inch, 3.5 * inch])
events.setStyle(TableStyle([("VALIGN", (0, 0), (-1, -1), "TOP"),
("LEFTPADDING", (0, 0), (-1, -1), 0)]))
story += [parties, Spacer(1, 14)]
rows = [["Description", "Qty", "Unit price", "Amount"]]
for (desc, q, up, t) in c["items"]:
rows.append([desc, str(q), money(up), money(t)])
items_tbl = Table(rows, colWidths=[3.5 * inch, 0.7 * inch, 1.4 * inch, 1.4 * inch])
items_tbl.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#2b3a67")),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("FONTSIZE", (0, 0), (-1, -1), 9.5),
("ALIGN", (1, 0), (-1, -1), "RIGHT"),
("GRID", (0, 0), (-1, -1), 0.4, colors.HexColor("#cdd3e6")),
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#eef1f8")]),
("LEFTPADDING", (0, 0), (-1, -1), 8), ("TOPPADDING", (0, 0), (-1, -1), 5),
("BOTTOMPADDING", (0, 0), (-1, -1), 5)]))
story += [items_tbl, Spacer(1, 10)]
tot_rows = [["Subtotal", money(c["subtotal"])]]
if c["discount"]:
tot_rows.append(["Discount", "-" + money(c["discount"])])
tot_rows.append([f"Tax ({d['tax_rate']*100:.1f}%)", cash(c["tax"])])
tot_rows.append(["TOTAL", money(c["total"])])
totals = Table(tot_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
totals.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("LINEABOVE", (0, -1), (-1, -1), 1.0, colors.HexColor("#2b3a67")),
("FONTNAME", (0, -1), (-1, -1), "Helvetica-Bold"),
("TEXTCOLOR", (0, -1), (-1, -1), colors.HexColor("#2b3a67")),
("TOPPADDING", (0, 0), (-1, -1), 3), ("BOTTOMPADDING", (0, 0), (-1, -1), 3)]))
story += [totals, Spacer(1, 8)]
pay_rows = [["Amount paid", money(c["amount_paid"])],
["Balance due", money(c["balance"])]]
pay = Table(pay_rows, colWidths=[1.6 * inch, 1.4 * inch], hAlign="RIGHT")
due_color = colours.HexColor("#1b7a3d") if c["is_paid"] else colours.HexColor("#7a2e2e")
pay.setStyle(TableStyle([
("FONTSIZE", (0, 0), (-1, -1), 10),
("ALIGN", (0, 0), (-1, -1), "RIGHT"),
("FONTNAME", (0, 1), (-1, 1), "Helvetica-Bold"),
("TEXTCOLOR", (0, 1), (-1, 1), due_color),
("TOPPADDING", (0, 0), (-1, -1), 2), ("BOTTOMPADDING", (0, 0), (-1, -1), 2)]))
standing = "PAID IN FULL" if c["is_paid"] else "BALANCE DUE"
story += [pay, Spacer(1, 6),
Paragraph(f"<b>Status:</b> {status}", BODY), Spacer(1, 16),
Paragraph("Notes", LBL), Paragraph(d["notes"], BODY)]
SimpleDocTemplate(path, pagesize=LETTER,
topMargin=0.7 * inch, bottomMargin=0.7 * inch,
leftMargin=0.8 * inch, rightMargin=0.8 * inch).construct(story)
print("STEP 3/7 · Generating artificial bill PDFs…")
CORPUS = []
for i, d in enumerate(DOCS):
path = f"/content material/invoice_{i}.pdf" if os.path.isdir("/content material") else f"invoice_{i}.pdf"
render_pdf(d, path)
CORPUS.append((d, ground_truth(d), path))
print(f" ✓ {os.path.basename(path)} — {d['vendor_name']} → {d['bill_to_name']}")
print()
if SHOW_FIRST_PAGE:
attempt:
import pypdfium2 as pdfium, matplotlib.pyplot as plt
pg = pdfium.PdfDocument(CORPUS[0][2])[0]
img = pg.render(scale=2.0).to_pil()
plt.determine(figsize=(6.4, 8.3)); plt.imshow(img); plt.axis("off")
plt.title("What carry reads — web page 1 of invoice_0.pdf", fontsize=10); plt.present()
besides Exception as e:
print(" web page preview skipped:", e, "n")
We render every artificial bill into a lifelike one-page PDF utilizing ReportLab, together with headers, bill metadata, billing and transport blocks, line-item tables, totals, fee standing, and notes. We deliberately protect structure parts that make bill extraction troublesome, similar to separate bill-to and ship-to sections and subtotal versus complete fields. We then generate the PDF corpus and optionally preview the primary web page utilizing pypdfium2 and Matplotlib. This step creates the precise visible paperwork that carry reads throughout extraction.
SCHEMA = {
"kind": "object",
"properties": {
"invoice_number": {"kind": "string", "description": "The bill's distinctive identifier / quantity"},
"invoice_date": {"kind": "string", "description": "Date the bill was issued (as printed)"},
"due_date": {"kind": "string", "description": "Date fee is due"},
"vendor": {
"kind": "object",
"description": "The social gathering that ISSUED the bill (the vendor / provider)",
"properties": {
"title": {"kind": "string"},
"deal with": {"kind": "string"},
}},
"customer_name": {"kind": "string",
"description": "The social gathering the bill is billed TO (the 'Bill To' social gathering) — "
"not the seller, and not the 'Ship To' social gathering if it differs"},
"purchase_order_number": {"kind": "string",
"description": "The PO quantity referenced on the bill. "
"Return null if no purchase-order quantity seems"},
"forex": {"kind": "string",
"description": "ISO 4217 forex code of the quantities, e.g. USD or EUR"},
"line_items": {
"kind": "array",
"description": "Every billed line merchandise, so as",
"objects": {"kind": "object", "properties": {
"description": {"kind": "string"},
"amount": {"kind": "quantity"},
"unit_price": {"kind": "quantity"},
"line_total": {"kind": "quantity", "description": "amount × unit_price for this line"},
}}},
"subtotal": {"kind": "quantity", "description": "Sum of line totals BEFORE tax and low cost"},
"discount_amount": {"kind": "quantity",
"description": "Total low cost utilized. Return null if no low cost is proven"},
"tax_amount": {"kind": "quantity", "description": "Total tax / VAT charged"},
"total_amount": {"kind": "quantity",
"description": "The grand complete the client owes, AFTER tax and any low cost — "
"NOT the pre-tax subtotal and NOT the tax line"},
"amount_paid": {"kind": "quantity", "description": "Amount already paid deposits included"},
"balance_due": {"kind": "quantity", "description": "Outstanding steadiness nonetheless owed"},
"is_paid": {"kind": "boolean",
"description": "true ONLY if the steadiness due is zero. A partial fee or "
"deposit with a remaining steadiness does NOT rely as paid"},
},
"required": ["invoice_number", "total_amount", "vendor"],
}
def _norm(s):
return re.sub(r"s+", " ", str(s).strip().decrease()).strip(" .,:;/")
def _num(x):
attempt: return float(str(x).change("%", "").change(",", "").change("$", "").change("€", "").strip())
besides Exception: return None
def leaf_equal(gt, pr):
if gt is None and pr is None: return True
if gt is None or pr is None: return False
if isinstance(gt, bool) or isinstance(pr, bool): return bool(gt) == bool(pr)
a, b = _num(gt), _num(pr)
if a shouldn't be None and b shouldn't be None:
return abs(a - b) < 1e-6 if b == 0 else abs(a - b) / max(abs(a), abs(b)) < 5e-3
return _norm(gt) == _norm(pr)
def flatten(o, prefix=""):
out = {}
if isinstance(o, dict):
for okay, v in o.objects():
out.replace(flatten(v, f"{prefix}.{okay}" if prefix else okay))
elif isinstance(o, listing):
for i, v in enumerate(o):
out.replace(flatten(v, f"{prefix}[{i}]"))
else:
out[prefix] = o
return out
def rating(gt, pred):
fg, fp = flatten(gt), flatten(pred or {})
rows, right = [], 0
for key, gv in fg.objects():
current = key in fp
pv = fp.get(key)
okay = (gv is None and (not current or pv is None)) or (current and leaf_equal(gv, pv))
right += int(okay)
rows.append((key, gv, (pv if current else None), okay))
return (right / len(fg) if fg else 0.0), rows
We outline the JSON extraction schema that tells carry precisely which bill fields to recuperate and the best way to interpret ambiguous values. The schema makes use of discipline descriptions to information the mannequin towards the bill-to buyer, the after-tax complete quantity, nullable PO and low cost fields, and the proper payment-status logic. We additionally implement normalization, numeric parsing, recursive flattening, and field-level comparability utilities. These scoring features allow us to evaluate carry’s predicted JSON towards the recognized floor fact, with tolerance for variations in numeric formatting.
print("STEP 4/7 · Extracting with carry and scoring towards floor fact…n")
outcomes = []
for i, (src, gt, path) in enumerate(CORPUS):
t0 = time.time()
pred = run_lift(path, SCHEMA)
dt = time.time() - t0
acc, rows = rating(gt, pred)
outcomes.append(dict(src=src, gt=gt, pred=pred, acc=acc, rows=rows, seconds=dt))
print(f" bill {i} · {src['vendor_name']:<28} discipline accuracy {acc*100:5.1f}% ({dt:.1f}s)")
r0 = outcomes[0]
print("n" + "=" * 90)
print(f"DETAILED VIEW · bill 0 · {r0['src']['vendor_name']} → {r0['src']['bill_to_name']}")
print("=" * 90)
print("Raw JSON carry returned assured to match the schema form:n")
print(json.dumps(r0["pred"], indent=2, ensure_ascii=False))
import pandas as pd
pd.set_option("show.max_colwidth", 46)
pd.set_option("show.width", 120)
grade = pd.DataFrame([{"field": k,
"ground_truth": ("∅ null" if g is None else g),
"lift_predicted": ("∅ null" if p is None else p),
"✓": "✓" if ok else "✗"}
for (k, g, p, ok) in r0["rows"]])
print("nField-by-field grade:n")
print(grade.to_string(index=False))
print("nWhat to look for:")
print(" • total_amount needs to be the AFTER-TAX grand complete, not the subtotal — the distractor take a look at.")
print(" • customer_name needs to be the BILL-TO social gathering, not the completely different SHIP-TO warehouse.")
print(" • purchase_order_number and discount_amount needs to be ∅ null: bill 0 has neither.")
print(" • on bill 2, is_paid should be False — a $15,000 deposit is proven however a steadiness stays.")
print("n" + "=" * 90)
We run carry throughout each generated bill, accumulate the extracted JSON, measure runtime, and calculate field-level accuracy towards the bottom fact. We then examine the primary bill intimately by printing the uncooked mannequin output and a field-by-field grading desk. This diagnostic view helps us confirm whether or not the mannequin handles an important extraction traps accurately, together with null fields, bill-to versus ship-to choice, and complete quantity disambiguation. We use this part as the primary analysis checkpoint for the tutorial.
print("STEP 5/7 · Assembling the extractions into a queryable accounts-payable ledger")
print("=" * 90)
def g(d, path, default=None):
cur = d
for key in path.cut up("."):
if isinstance(cur, dict) and cur.get(key) shouldn't be None:
cur = cur[key]
else:
return default
return cur
kb = pd.DataFrame([{
"invoice": g(r["pred"], "invoice_number"),
"vendor": g(r["pred"], "vendor.title"),
"buyer": g(r["pred"], "customer_name"),
"ccy": g(r["pred"], "forex"),
"complete": g(r["pred"], "total_amount"),
"paid": g(r["pred"], "amount_paid"),
"steadiness": g(r["pred"], "balance_due"),
"is_paid": g(r["pred"], "is_paid"),
"objects": len(g(r["pred"], "line_items", []) or []),
"po": g(r["pred"], "purchase_order_number"),
"field_acc": spherical(r["acc"], 3),
} for r in outcomes])
print("nAccounts-payable ledger one row per mined bill:n")
print(kb.to_string(index=False))
print("nExample question — OUTSTANDING invoices not absolutely paid, largest steadiness first:n")
owed = kb[kb["is_paid"] != True].sort_values("steadiness", ascending=False)
print(owed.to_string(index=False) if len(owed) else " every part is paid
")
attempt:
total_owed = sum((r or 0) for r in kb.loc[kb["is_paid"] != True, "steadiness"])
print(f"nTotal excellent throughout the batch: {total_owed:,.2f} combined currencies — group by ccy in apply")
besides Exception:
go
general = sum(r["acc"] for r in outcomes) / len(outcomes)
print(f"nSTEP 6/7 · Overall discipline accuracy throughout {len(outcomes)} invoices: {general*100:.1f}%")
print(" Datalab report carry at ~90.2% discipline accuracy on their 225-doc benchmark.")
attempt:
import matplotlib.pyplot as plt
labels = [r["src"]["vendor_name"].cut up()[0] for r in outcomes]
accs = [r["acc"] * 100 for r in outcomes]
plt.determine(figsize=(7, 3.6))
bars = plt.bar(labels, accs, colour="#2b3a67")
plt.axhline(90.2, ls="--", colour="#7a2e2e", lw=1.4, label="carry benchmark 90.2%")
for b, a in zip(bars, accs):
plt.textual content(b.get_x() + b.get_width()/2, a + 1, f"{a:.0f}%", ha="middle", fontsize=9)
plt.ylim(0, 108); plt.ylabel("Field accuracy %")
plt.title("Per-invoice extraction accuracy on the artificial corpus")
plt.legend(fontsize=8); plt.tight_layout(); plt.present()
besides Exception as e:
print(" chart skipped:", e, ")")
if RUN_ON_REAL_PDF and REAL_PDF_URL:
print("n" + "=" * 90)
print(f"STEP 7/7 · Bonus — extracting from a REAL bill: {REAL_PDF_URL}")
print("=" * 90)
attempt:
import urllib.request
real_path = "/content material/real_invoice.pdf" if os.path.isdir("/content material") else "real_invoice.pdf"
urllib.request.urlretrieve(REAL_PDF_URL, real_path)
pred_real = run_lift(real_path, SCHEMA, page_range=REAL_PDF_PAGES)
print("nExtraction no floor fact — actual invoices fluctuate wildly in structure:n")
print(json.dumps(pred_real, indent=2, ensure_ascii=False))
print("nTip: actual invoices differ massively by vendor. Tighten the sector `description`s and use "
"page_range to level carry on the web page that carries the totals block.")
besides Exception as e:
print(" real-PDF go failed:", e)
else:
print("nSTEP 7/7 · skipped set RUN_ON_REAL_PDF = True and REAL_PDF_URL to mine your personal bill.")
print("n
Done. You now have: schema-valid bill extractions, a scored grade, and an AP ledger.")
print(" Next: swap in your personal bill PDFs + tweak SCHEMA, or reuse MODEL throughout hundreds of information.")
We convert the extracted bill information into a compact accounts-payable ledger utilizing pandas, with one row per mined bill. We embrace operational fields similar to bill quantity, vendor, buyer, forex, complete, quantity paid, steadiness due, fee standing, merchandise rely, PO quantity, and extraction accuracy. We then question the ledger for excellent invoices and calculate the entire unpaid steadiness throughout the batch. Finally, we visualize per-invoice accuracy and optionally apply the identical schema to a actual bill PDF when a URL is supplied.
In conclusion, we accomplished the tutorial by changing unstructured invoice PDFs into schema-valid JSON records, validating each extracted field against known ground truth, and assembling the outcomes into a queryable accounts-payable ledger. This offers us greater than a primary extraction demo: we evaluated how properly the mannequin handles numerical fields, nested vendor objects, arrays of line objects, nullable attributes, boolean fee logic, and layout-level distractors that usually break brittle parsers. We additionally reused a single loaded inference supervisor throughout the batch, which displays how we might scale this workflow throughout many invoices with out repeatedly reinitializing the mannequin. By the tip, we’ve got a reproducible pipeline that generates take a look at invoices, extracts structured monetary information, scores the output, visualizes accuracy, and optionally extends to actual bill PDFs with the identical schema-driven method.
Check out the Full Colab Notebook here. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up Designing a Schema-Guided Invoice Intelligence Pipeline with lift-pdf for Accounts-Payable Extraction, Validation, and Ledger Generation appeared first on MarkTechPost.
