|

Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation

✅

In this tutorial, we build a complete PDF-to-structured-data extraction workflow round Lift, with a concentrate on managed analysis reasonably than a easy demo run. We start by getting ready a Colab-compatible GPU surroundings, deciding on the suitable precision mode for the accessible {hardware}, and patching mannequin loading to make sure the Lift backend runs reliably even on constrained 16 GB GPUs through 4-bit NF4 quantization. From there, we generate artificial multi-page analysis studies with intentionally positioned distractors, together with validation-versus-test metric ambiguity, baseline-versus-proposed-model comparisons, lacking code-release instances, and boolean state-of-the-art claims. This offers a practical testbed for schema-guided extraction, during which the mannequin should get better titles, authors, datasets, metrics, hyperparameters, limitations, and repository hyperlinks from doc layouts reasonably than plain textual content.

Configuring Runtime and Dependencies

N_DOCS               = 3
FORCE_FULL_PRECISION = False
FORCE_4BIT           = False
SHOW_FIRST_PAGE      = True
RUN_ON_REAL_PDF      = False
REAL_PDF_URL         = "https://arxiv.org/pdf/1512.03385"
REAL_PDF_PAGES       = "0-3"
PIN_PILLOW           = True
PILLOW_VERSION       = "11.3.0"
import os, sys, subprocess, json, re, time, warnings
warnings.filterwarnings("ignore")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
def pip(*pkgs, improve=False):
   """Install with out invoking a shell (so '[hf]' isn't glob-expanded)."""
   args = [sys.executable, "-m", "pip", "install", "-q"] + (["-U"] if improve else []) + listing(pkgs)
   print("  pip set up", *pkgs)
   subprocess.run(args, test=False)
print("STEP 1/7 · Installing carry + gentle dependencies (first run is the sluggish one)…")
pip("reportlab", "pypdfium2", "pandas", "matplotlib")
pip("lift-pdf[hf]")
pip("bitsandbytes", "speed up", improve=True)
if PIN_PILLOW:
   pip(f"pillow=={PILLOW_VERSION}")
   if "PIL" in sys.modules:
       import PIL
       if getattr(PIL, "__version__", "") != PILLOW_VERSION:
           print(f"     Pinned Pillow {PILLOW_VERSION} on disk, however a stale Pillow "
                 f"({getattr(PIL, '__version__', '?')}) is already loaded in reminiscence.")
           print("     Restarting the runtime now — simply re-run the cell(s) after it reconnects.")
           os.kill(os.getpid(), 9)
print("     …set up completed.n")
import torch

We configure the tutorial runtime by defining the main execution knobs for corpus size, precision mode, preview rendering, and optional real-PDF extraction. We additionally set up the core dependencies required for PDF technology, rendering, plotting, and Lift’s Hugging Face backend. The Pillow pinning logic is essential as a result of it prevents a recognized Colab compatibility subject during which newer Pillow builds can break downstream imports through torchvision and transformers.

Loading Lift 4-bit Backend

def detect_gpu():
   if not torch.cuda.is_available():
       elevate SystemExit(
           "n✗ No CUDA GPU discovered. In Colab: Runtime ▸ Change runtime sort ▸ GPU "
           "(A100 is finest; L4/T4 additionally work).n"
       )
   p  = torch.cuda.get_device_properties(0)
   cc = torch.cuda.get_device_capability(0)
   return p.title, p.total_memory / 1e9, cc
def enable_4bit(compute_dtype):
   """
   Load carry's weights in 4-bit NF4 irrespective of which transformers Auto* class it makes use of
   internally. We inject a quantization_config + on-GPU device_map, and neutralize any
   later mannequin.to()/.cuda() (which is against the law on a bnb-quantized mannequin). This is what lets
   a ~10 B mannequin match on a 16 GB T4 / 24 GB L4.
   """
   import examine, functools, transformers
   from transformers import BitsAndBytesConfig
   bnb = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
       bnb_4bit_compute_dtype=compute_dtype,
   )
   def patch(cls):
       attempt:
           cm   = examine.getattr_static(cls, "from_pretrained")
           orig = cm.__func__ if isinstance(cm, (classmethod, staticmethod)) else cm
       besides Exception:
           return
       @functools.wraps(orig)
       def interior(cls_, *args, **kwargs):
           kwargs.setdefault("quantization_config", bnb)
           kwargs.setdefault("device_map", {"": 0})
           mannequin = orig(cls_, *args, **kwargs)
           attempt:
               mannequin.to   = lambda *a, **okay: mannequin
               mannequin.cuda = lambda *a, **okay: mannequin
           besides Exception:
               cross
           return mannequin
       cls.from_pretrained = classmethod(interior)
   for title in ["AutoModelForImageTextToText", "AutoModelForMultimodalLM",
                "AutoModelForVision2Seq", "AutoModelForCausalLM", "AutoModel"]:
       c = getattr(transformers, title, None)
       if c is just not None:
           patch(c)
   attempt:
       from transformers.modeling_utils import PreTrainedMannequin
       patch(PreTrainedMannequin)
   besides Exception:
       cross
print("STEP 2/7 · Preparing the mannequin backend…")
gpu_name, vram, cc = detect_gpu()
use_4bit      = FORCE_4BIT or (vram < 34 and never FORCE_FULL_PRECISION)
compute_dtype = torch.bfloat16 if cc[0] >= 8 else torch.float16
print(f"     GPU: {gpu_name} | ~{vram:.0f} GB | compute functionality {cc[0]}.{cc[1]}")
print(f"     Load mode: {'4-bit NF4' if use_4bit else 'full bf16'} (compute dtype {compute_dtype})")
os.environ.setdefault("TORCH_DEVICE", "cuda:0")
os.environ.setdefault("MODEL_CHECKPOINT", "datalab-to/carry")
if use_4bit:
   enable_4bit(compute_dtype)
from carry import extract
from carry.mannequin import InferenceManager
print("     Loading carry weights (≈20 GB obtain on first run)…")
_t = time.time()
MODEL = InferenceManager(methodology="hf")
print(f"     ✓ mannequin prepared in {time.time() - _t:.0f}sn")
def run_lift(pdf_path, schema, page_range=None):
   kw = {"mannequin": MODEL}
   if page_range:
       kw["page_range"] = page_range
   outcome = extract(pdf_path, schema, **kw)
   return getattr(outcome, "extraction", None)

We put together the Lift inference backend by detecting accessible CUDA GPUs, estimating VRAM utilization, and selecting between full-precision and 4-bit NF4 loading. The 4-bit patch injects a BitsAndBytes quantization configuration into appropriate Transformers mannequin loaders, permitting the mannequin to match on smaller GPUs comparable to T4 or L4. We then initialize a reusable InferenceManager that avoids reloading the model for each document and makes the extraction pipeline practical for batch processing.

Building the Synthetic Corpus

DOCS = [
   dict(
       title="SolarNet: Efficient Land-Cover Classification from Multispectral Satellite Imagery",
       authors=[("Maya Okafor", "TU Delft"),
                ("Liang Wei", "TU Delft"),
                ("Priya Ramachandran", "European Space Research Institute")],
       activity="satellite tv for pc picture land-cover classification",
       methodology="SolarNet",
       datasets=["EuroSAT", "BigEarthNet", "So2Sat"],
       primary_benchmark="EuroSAT",
       metric_name="Top-1 accuracy",
       test_acc=96.4, val_acc=97.1, baseline_name="ResNet-50",
       baseline_val=92.0, baseline_test=91.2,
       params_m=42.7, optimizer="AdamW", lr=0.0003, batch=128, epochs=90,
       beats_sota=True, prior_best=95.1,
       code_url=None,
       funding_note="This work was supported by the Open Earth Initiative. "
                    "The authors don't launch supply code for the educated fashions.",
       limitations=["Accuracy degrades on scenes with heavy cloud cover.",
                    "Trained only on imagery at 10 m spatial resolution."],
   ),
   dict(
       title="GraphMoE: Mixture-of-Experts Message Passing for Molecular Property Prediction",
       authors=[("Sofia Álvarez", "ETH Zürich"),
                ("Daniel Kim", "ETH Zürich"),
                ("Yara Haddad", "Genentech"),
                ("Tom Becker", "ETH Zürich")],
       activity="molecular property prediction",
       methodology="GraphMoE",
       datasets=["OGB-MolHIV", "QM9", "ZINC"],
       primary_benchmark="OGB-MolHIV",
       metric_name="ROC-AUC",
       test_acc=0.812, val_acc=0.828, baseline_name="GIN",
       baseline_val=0.784, baseline_test=0.771,
       params_m=8.3, optimizer="Adam", lr=0.001, batch=256, epochs=120,
       beats_sota=True, prior_best=0.799,
       code_url="https://github.com/mol-ai/graphmoe",
       funding_note="Funded by the Swiss NSF. Code and pretrained checkpoints can be found "
                    "at https://github.com/mol-ai/graphmoe.",
       limitations=["Expert routing adds ~15% inference latency versus a dense GNN.",
                    "Evaluated only on small-molecule datasets under 50 heavy atoms."],
   ),
   dict(
       title="AcoustiFormer: A Compact Transformer for Environmental Sound Classification",
       authors=[("Noah Fischer", "University of Edinburgh"),
                ("Aisha Bello", "University of Edinburgh"),
                ("Kenji Watanabe", "Sony CSL")],
       activity="environmental sound classification",
       methodology="AcoustiFormer",
       datasets=["ESC-50", "UrbanSound8K"],
       primary_benchmark="ESC-50",
       metric_name="accuracy",
       test_acc=88.7, val_acc=90.3, baseline_name="CNN14",
       baseline_val=90.8, baseline_test=89.2,
       params_m=22.1, optimizer="AdamW", lr=0.0005, batch=64, epochs=200,
       beats_sota=False, prior_best=89.2,
       code_url="https://github.com/audio-lab/acoustiformer",
       funding_note="Code accessible at https://github.com/audio-lab/acoustiformer.",
       limitations=["A larger CNN baseline still outperforms our model on ESC-50.",
                    "Performance was not evaluated on real-time streaming audio."],
   ),
][:N_DOCS]
def ground_truth(d):
   """Reshape a supply dict into the precise JSON form our schema asks for."""
   return {
       "title": d["title"],
       "authors": [{"name": n, "affiliation": a} for (n, a) in d["authors"]],
       "primary_task": d["task"],
       "proposed_method_name": d["method"],
       "datasets": d["datasets"],
       "headline_metric": {"title": d["metric_name"],
                           "worth": d["test_acc"],
                           "benchmark": d["primary_benchmark"]},
       "num_parameters_millions": d["params_m"],
       "hyperparameters": {"optimizer": d["optimizer"], "learning_rate": d["lr"],
                           "batch_size": d["batch"], "epochs": d["epochs"]},
       "beats_prior_sota": d["beats_sota"],
       "code_url": d["code_url"],
       "limitations": d["limitations"],
   }

We outline a small however fastidiously managed artificial corpus of machine-learning analysis studies with structured metadata. Each doc contains lifelike fields comparable to authors, datasets, benchmark metrics, hyperparameters, mannequin measurement, code availability, limitations, and SOTA claims. The ground_truth operate reshapes the identical supply metadata into the precise JSON construction anticipated by the extraction schema, offering a exact reference for analysis.

Rendering Multi-Page PDF Reports

def render_pdf(d, path):
   """Draw a practical 3-page report. Page breaks are compelled so the headline metric on
   web page 1 (summary) is bodily separated from the outcomes desk on web page 3."""
   from reportlab.lib.pagesizes import LETTER
   from reportlab.lib.kinds import getSampleStyleSheet, ParagraphModel
   from reportlab.lib.models import inch
   from reportlab.lib import colours
   from reportlab.platypus import (SimpleDocTemplate, Paragraph, Spacer,
                                   Table, TableModel, PageBreak)
   ss = getSampleStyleSheet()
   H1   = ParagraphModel("H1", father or mother=ss["Title"], fontSize=16, main=20, spaceAfter=6)
   AUTH = ParagraphModel("AUTH", father or mother=ss["Normal"], fontSize=9.5, textColor=colours.gray, spaceAfter=10)
   H2   = ParagraphModel("H2", father or mother=ss["Heading2"], fontSize=12, spaceBefore=8, spaceAfter=4)
   BODY = ParagraphModel("BODY", father or mother=ss["Normal"], fontSize=10, main=14, spaceAfter=6)
   sota_phrase = (f"surpassing the earlier better of {d['prior_best']}"
                  if d["beats_sota"] else
                  f"approaching however not exceeding the earlier better of {d['prior_best']}")
   authors_line = ", ".be a part of(f"{n} ({a})" for (n, a) in d["authors"])
   story = []
   story += [Paragraph(d["title"], H1), Paragraph(authors_line, AUTH), Paragraph("Abstract", H2)]
   story += [Paragraph(
       f"We introduce {d['method']}, a mannequin for {d['task']}. On the {d['primary_benchmark']} "
       f"benchmark, {d['method']} attains {d['test_acc']} {d['metric_name']} on the held-out "
       f"take a look at set, {sota_phrase}. Our {d['params_m']}M-parameter mannequin is evaluated throughout "
       f"{len(d['datasets'])} datasets ({', '.be a part of(d['datasets'])}). "
       f"Extensive ablations verify the contribution of every element.", BODY)]
   story += [Paragraph("Keywords", H2),
             Paragraph(f"{d['task']}; illustration studying; {d['primary_benchmark']}", BODY),
             PageBreak()]
   story += [Paragraph("1  Method and Training Details", H2)]
   story += [Paragraph(
       f"{d['method']} is educated end-to-end with the {d['optimizer']} optimizer. "
       f"We tune on a validation break up and report closing numbers on the take a look at break up. "
       f"The full coaching configuration is summarized in Table 1.", BODY)]
   hp = [["Hyperparameter", "Value"],
         ["Optimizer", d["optimizer"]],
         ["Learning rate", str(d["lr"])],
         ["Batch size", str(d["batch"])],
         ["Epochs", str(d["epochs"])],
         ["Parameters", f"{d['params_m']}M"]]
   t1 = Table(hp, colWidths=[2.4 * inch, 2.0 * inch])
   t1.setStyle(TableModel([
       ("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#2b3a67")),
       ("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
       ("FONTSIZE", (0, 0), (-1, -1), 9.5),
       ("GRID", (0, 0), (-1, -1), 0.4, colors.grey),
       ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#eef1f8")]),
       ("LEFTPADDING", (0, 0), (-1, -1), 8), ("TOPPADDING", (0, 0), (-1, -1), 4),
       ("BOTTOMPADDING", (0, 0), (-1, -1), 4)]))
   story += [Spacer(1, 4), t1, Spacer(1, 6),
             Paragraph("<b>Table 1.</b> Training configuration.", BODY),
             Paragraph("2  Datasets", H2),
             Paragraph(
                 f"We evaluate on {', '.join(d['datasets'])}. {d['primary_benchmark']} is our "
                 f"main benchmark; the remaining datasets are used for generalization "
                 f"research.", BODY),
             PageBreak()]
   story += [Paragraph("3  Results", H2)]
   res = [["Method", f"Val. {d['metric_name']}", f"Test {d['metric_name']}"],
          [f"{d['baseline_name']} (baseline)", str(d["baseline_val"]), str(d["baseline_test"])],
          [f"{d['method']} (ours)", str(d["val_acc"]), str(d["test_acc"])]]
   t2 = Table(res, colWidths=[2.6 * inch, 1.7 * inch, 1.7 * inch])
   t2.setStyle(TableModel([
       ("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#7a2e2e")),
       ("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
       ("FONTSIZE", (0, 0), (-1, -1), 9.5),
       ("GRID", (0, 0), (-1, -1), 0.4, colors.grey),
       ("FONTNAME", (0, 2), (-1, 2), "Helvetica-Bold"),
       ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#f7eeee")]),
       ("LEFTPADDING", (0, 0), (-1, -1), 8), ("TOPPADDING", (0, 0), (-1, -1), 4),
       ("BOTTOMPADDING", (0, 0), (-1, -1), 4)]))
   story += [Spacer(1, 4), t2, Spacer(1, 6),
             Paragraph(f"<b>Table 2.</b> Results on {d['primary_benchmark']}. "
                       f"Best take a look at lead to daring.", BODY),
             Paragraph("4  Limitations", H2)]
   for lim in d["limitations"]:
       story += [Paragraph("• " + lim, BODY)]
   story += [Paragraph("5  Funding and Code Availability", H2),
             Paragraph(d["funding_note"], BODY)]
   SimpleDocTemplate(path, pagesize=LETTER,
                     topMargin=0.8 * inch, bottomMargin=0.8 * inch,
                     leftMargin=0.9 * inch, rightMargin=0.9 * inch).construct(story)
print("STEP 3/7 · Generating artificial report PDFs…")
CORPUS = []
for i, d in enumerate(DOCS):
   path = f"/content material/report_{i}.pdf" if os.path.isdir("/content material") else f"report_{i}.pdf"
   render_pdf(d, path)
   CORPUS.append((d, ground_truth(d), path))
   print(f"     ✓ {os.path.basename(path)}  —  {d['method']}")
print()
if SHOW_FIRST_PAGE:
   attempt:
       import pypdfium2 as pdfium, matplotlib.pyplot as plt
       pg  = pdfium.PdfDocument(CORPUS[0][2])[0]
       img = pg.render(scale=2.0).to_pil()
       plt.determine(figsize=(6.4, 8.3)); plt.imshow(img); plt.axis("off")
       plt.title("What carry reads — web page 1 of report_0.pdf", fontsize=10); plt.present()
   besides Exception as e:
       print("     (web page preview skipped:", e, ")n")

We render every artificial analysis report as a multi-page PDF utilizing ReportLab, together with formatted sections, tables, web page breaks, authorship metadata, outcomes, and limitations. The structure intentionally spreads essential proof throughout pages so the extraction activity behaves extra like actual doc mining reasonably than easy textual content parsing. The optionally available web page preview makes use of pypdfium2 and Matplotlib to confirm what the mannequin visually receives earlier than extraction begins.

Defining the Extraction Schema

SCHEMA = {
   "sort": "object",
   "properties": {
       "title": {"sort": "string", "description": "The full title of the paper"},
       "authors": {
           "sort": "array",
           "description": "Every creator listed, so as",
           "objects": {"sort": "object", "properties": {
               "title":        {"sort": "string"},
               "affiliation": {"sort": "string", "description": "The creator's establishment"},
           }},
       },
       "primary_task": {"sort": "string",
                        "description": "The foremost machine-learning activity the paper addresses"},
       "proposed_method_name": {"sort": "string",
                                "description": "Name of the mannequin/methodology the paper introduces "
                                               "(not a baseline it compares in opposition to)"},
       "datasets": {"sort": "array", "objects": {"sort": "string"},
                    "description": "All benchmark datasets the paper evaluates on"},
       "headline_metric": {
           "sort": "object",
           "description": "The main reported outcome for the proposed methodology",
           "properties": {
               "title":      {"sort": "string", "description": "Metric title, e.g. Top-1 accuracy or ROC-AUC"},
               "worth":     {"sort": "quantity", "description": "The proposed methodology's worth for this metric on "
                                                             "the PRIMARY benchmark's TEST set — not the "
                                                             "validation quantity and never a baseline's quantity"},
               "benchmark": {"sort": "string", "description": "The dataset the headline metric is reported on"},
           }},
       "num_parameters_millions": {"sort": "quantity",
                                   "description": "Total parameter rely of the proposed mannequin, in thousands and thousands"},
       "hyperparameters": {
           "sort": "object",
           "properties": {
               "optimizer":     {"sort": "string"},
               "learning_rate": {"sort": "quantity"},
               "batch_size":    {"sort": "integer"},
               "epochs":        {"sort": "integer"},
           }},
       "beats_prior_sota": {"sort": "boolean",
                            "description": "true provided that the paper claims its proposed methodology beats the "
                                           "earlier cutting-edge on the first benchmark; in any other case false"},
       "code_url": {"sort": "string",
                    "description": "URL of the launched source-code repository. Return null if the paper "
                                   "doesn't launch code"},
       "limitations": {"sort": "array", "objects": {"sort": "string"},
                       "description": "Limitations the authors explicitly acknowledge"},
   },
   "required": ["title", "proposed_method_name", "headline_metric"],
}

We outline a JSON Schema that tells Lift precisely which fields to extract and the way every area ought to be interpreted. The schema descriptions are technically essential as a result of they disambiguate proposed-method values from baseline values, take a look at metrics from validation metrics, and released-code URLs from specific no-code instances. This turns the extraction activity into a managed, schema-guided data retrieval activity reasonably than an open-ended summarization activity.

Scoring Against Ground Truth

def _norm(s):
   return re.sub(r"s+", " ", str(s).strip().decrease()).strip(" .,:;/")
def _num(x):
   attempt:    return float(str(x).substitute("%", "").substitute(",", "").strip())
   besides Exception: return None
def leaf_equal(gt, pr):
   if gt is None and pr is None:                       return True
   if gt is None or pr is None:                        return False
   if isinstance(gt, bool) or isinstance(pr, bool):    return bool(gt) == bool(pr)
   a, b = _num(gt), _num(pr)
   if a is just not None and b is just not None:
       return abs(a - b) < 1e-6 if b == 0 else abs(a - b) / max(abs(a), abs(b)) < 5e-3
   return _norm(gt) == _norm(pr)
def flatten(o, prefix=""):
   out = {}
   if isinstance(o, dict):
       for okay, v in o.objects():
           out.replace(flatten(v, f"{prefix}.{okay}" if prefix else okay))
   elif isinstance(o, listing):
       for i, v in enumerate(o):
           out.replace(flatten(v, f"{prefix}[{i}]"))
   else:
       out[prefix] = o
   return out
def rating(gt, pred):
   fg, fp = flatten(gt), flatten(pred or {})
   rows, appropriate = [], 0
   for key, gv in fg.objects():
       current = key in fp
       pv = fp.get(key)
       okay = (gv is None and (not current or pv is None)) or (current and leaf_equal(gv, pv))
       appropriate += int(okay)
       rows.append((key, gv, (pv if current else None), okay))
   return (appropriate / len(fg) if fg else 0.0), rows
print("STEP 4/7 · Extracting with carry and scoring in opposition to floor reality…n")
outcomes = []
for i, (src, gt, path) in enumerate(CORPUS):
   t0 = time.time()
   pred = run_lift(path, SCHEMA)
   dt = time.time() - t0
   acc, rows = rating(gt, pred)
   outcomes.append(dict(src=src, gt=gt, pred=pred, acc=acc, rows=rows, seconds=dt))
   print(f"     doc {i} · {src['method']:<14} area accuracy {acc*100:5.1f}%   ({dt:.1f}s)")
r0 = outcomes[0]
print("n" + "=" * 90)
print(f"DETAILED VIEW · doc 0 · {r0['src']['method']}")
print("=" * 90)
print("Raw JSON carry returned (assured to match the schema form):n")
print(json.dumps(r0["pred"], indent=2, ensure_ascii=False))
import pandas as pd
pd.set_option("show.max_colwidth", 46)
pd.set_option("show.width", 120)
grade = pd.DataFrame([{"field": k,
                      "ground_truth": ("∅ (null)" if g is None else g),
                      "lift_predicted": ("∅ (null)" if p is None else p),
                      "✓": "✓" if ok else "✗"}
                     for (k, g, p, ok) in r0["rows"]])
print("nField-by-field grade:n")
print(grade.to_string(index=False))
print("nWhat to search for:")
print("  • headline_metric.worth ought to be the TEST quantity, not the upper validation quantity,")
print("    and never the baseline row — that is the near-miss-distractor take a look at.")
print("  • code_url ought to be ∅ (null): report 0 explicitly releases no code (abstention).")
print("  • each creator + affiliation and each dataset ought to be current (exhaustive lists).")

We implement a field-level scoring system that flattens nested JSON outputs and compares predictions in opposition to floor reality with type-aware logic. Numeric values are evaluated with tolerance, strings are normalized earlier than comparability, booleans are dealt with explicitly, and lacking/null values are handled fastidiously for abstention instances. The snippet then runs Lift on each generated PDF, information accuracy and latency, and prints an in depth diagnostic view for the primary doc.

print("n" + "=" * 90)

Assembling the Knowledge Base

print("STEP 5/7 · Assembling the extractions into a queryable analysis data base")
print("=" * 90)
def g(d, path, default=None):
   cur = d
   for key in path.break up("."):
       if isinstance(cur, dict) and cur.get(key) is just not None:
           cur = cur[key]
       else:
           return default
   return cur
kb = pd.DataFrame([{
   "method":     g(r["pred"], "proposed_method_name"),
   "activity":       g(r["pred"], "primary_task"),
   "benchmark":  g(r["pred"], "headline_metric.benchmark"),
   "metric":     g(r["pred"], "headline_metric.title"),
   "rating":      g(r["pred"], "headline_metric.worth"),
   "params_M":   g(r["pred"], "num_parameters_millions"),
   "beats_sota": g(r["pred"], "beats_prior_sota"),
   "authors":    len(g(r["pred"], "authors", []) or []),
   "code":       g(r["pred"], "code_url"),
   "field_acc":  spherical(r["acc"], 3),
} for r in outcomes])
print("nResearch data base (one row per mined paper):n")
print(kb.to_string(index=False))
print("nExample question — papers that declare to beat SOTA, finest outcome first:n")
received = kb[kb["beats_sota"] == True].sort_values("rating", ascending=False)
print((received if len(received) else "  (none on this pattern)").to_string(index=False) if len(received) else "  (none)")
total = sum(r["acc"] for r in outcomes) / len(outcomes)
print(f"nSTEP 6/7 · Overall area accuracy throughout {len(outcomes)} paperwork: {total*100:.1f}%")
print("     (Datalab report carry at ~90.2% area accuracy on their 225-doc benchmark.)")
attempt:
   import matplotlib.pyplot as plt
   labels = [r["src"]["method"] for r in outcomes]
   accs   = [r["acc"] * 100 for r in outcomes]
   plt.determine(figsize=(7, 3.6))
   bars = plt.bar(labels, accs, colour="#2b3a67")
   plt.axhline(90.2, ls="--", colour="#7a2e2e", lw=1.4, label="carry benchmark (90.2%)")
   for b, a in zip(bars, accs):
       plt.textual content(b.get_x() + b.get_width()/2, a + 1, f"{a:.0f}%", ha="middle", fontsize=9)
   plt.ylim(0, 108); plt.ylabel("Field accuracy (%)")
   plt.title("Per-document extraction accuracy on the artificial corpus")
   plt.legend(fontsize=8); plt.tight_layout(); plt.present()
besides Exception as e:
   print("     (chart skipped:", e, ")")
if RUN_ON_REAL_PDF:
   print("n" + "=" * 90)
   print(f"STEP 7/7 · Bonus — extracting from a REAL paper: {REAL_PDF_URL}")
   print("=" * 90)
   attempt:
       import urllib.request
       real_path = "/content material/real_paper.pdf" if os.path.isdir("/content material") else "real_paper.pdf"
       urllib.request.urlretrieve(REAL_PDF_URL, real_path)
       pred_real = run_lift(real_path, SCHEMA, page_range=REAL_PDF_PAGES)
       print("nExtraction (no floor reality — actual papers are genuinely tougher):n")
       print(json.dumps(pred_real, indent=2, ensure_ascii=False))
       print("nReal papers differ wildly in structure; tighten area `description`s and use "
             "page_range to level carry on the sections that carry the reply.")
   besides Exception as e:
       print("     real-PDF cross failed:", e)
else:
   print("nSTEP 7/7 · (skipped) set RUN_ON_REAL_PDF = True to additionally mine an actual arXiv PDF.")
print("n✅ Done. You now have: schema-valid extractions, a scored grade, and a data base.")
print("   Next: swap in your individual PDFs + schema, or reuse MODEL throughout 1000's of information.")

We convert the extracted information into a compact analysis data base the place every row represents one mined paper and key fields grow to be queryable columns. The tutorial then demonstrates a easy analytical question for papers claiming to beat prior SOTA and computes total extraction accuracy throughout the corpus. Finally, it visualizes per-document accuracy and optionally applies the identical schema to an actual arXiv PDF, extending the workflow from managed benchmarking to sensible doc intelligence.

Conclusion

In conclusion, we’ve greater than uncooked mannequin outputs: we’ve a repeatable extraction benchmark with ground-truth labels, field-level scoring, error inspection, and a compact analysis data base assembled from the mined PDFs. The tutorial confirmed how structured JSON schemas can flip unstructured technical paperwork into queryable information whereas additionally exposing the failure modes that matter in actual doc intelligence techniques, comparable to metric confusion, incomplete lists, unsupported null fields, and incorrect SOTA interpretation. The optionally available real-PDF stage prolonged the identical pipeline past artificial studies, making the pocket book a sensible template for evaluating Lift on customized analysis corpora, regulatory filings, technical studies, or any long-form paperwork the place exact structured extraction is extra essential than generic summarization.


Check out the Repo for Lift and Full Implementation Codes here. Also, be at liberty to observe us on Twitter and don’t overlook to be a part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation appeared first on MarkTechPost.

Similar Posts