|

A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization

In this tutorial, we discover how to use Google’s LangExtract library to remodel unstructured textual content into structured, machine-readable info. We start by putting in the required dependencies and securely configuring our OpenAI API key to leverage highly effective language fashions for extraction duties. Also, we’ll construct a reusable extraction pipeline that allows us to course of a spread of doc sorts, together with contracts, assembly notes, product bulletins, and operational logs. Through fastidiously designed prompts and instance annotations, we show how LangExtract can establish entities, actions, deadlines, dangers, and different structured attributes whereas grounding them to their precise supply spans. We additionally visualize the extracted info and arrange it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making methods.

!pip -q set up -U "langextract[openai]" pandas IPython


import os
import json
import textwrap
import getpass
import pandas as pd


OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


import langextract as lx
from IPython.show import show, HTML

We set up the required libraries, together with LangExtract, Pandas, and IPython, in order that our Colab atmosphere is prepared for structured extraction duties. We securely request the OpenAI API key from the consumer and retailer it as an atmosphere variable for secure entry throughout runtime. We then import the core libraries wanted to run LangExtract, show outcomes, and deal with structured outputs.

MODEL_ID = "gpt-4o-mini"


def run_extraction(
   text_or_documents,
   prompt_description,
   examples,
   output_stem,
   model_id=MODEL_ID,
   extraction_passes=1,
   max_workers=4,
   max_char_buffer=1800,
):
   consequence = lx.extract(
       text_or_documents=text_or_documents,
       prompt_description=prompt_description,
       examples=examples,
       model_id=model_id,
       api_key=os.environ["OPENAI_API_KEY"],
       fence_output=True,
       use_schema_constraints=False,
       extraction_passes=extraction_passes,
       max_workers=max_workers,
       max_char_buffer=max_char_buffer,
   )


   jsonl_name = f"{output_stem}.jsonl"
   html_name = f"{output_stem}.html"


   lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".")
   html_content = lx.visualize(jsonl_name)


   with open(html_name, "w", encoding="utf-8") as f:
       if hasattr(html_content, "information"):
           f.write(html_content.information)
       else:
           f.write(html_content)


   return consequence, jsonl_name, html_name


def extraction_rows(consequence):
   rows = []
   for ex in consequence.extractions:
       start_pos = None
       end_pos = None
       if getattr(ex, "char_interval", None):
           start_pos = ex.char_interval.start_pos
           end_pos = ex.char_interval.end_pos


       rows.append({
           "class": ex.extraction_class,
           "textual content": ex.extraction_text,
           "attributes": json.dumps(ex.attributes or {}, ensure_ascii=False),
           "begin": start_pos,
           "finish": end_pos,
       })
   return pd.DataFrame(rows)


def preview_result(title, consequence, html_name, max_rows=50):
   print("=" * 80)
   print(title)
   print("=" * 80)
   print(f"Total extractions: {len(consequence.extractions)}")
   df = extraction_rows(consequence)
   show(df.head(max_rows))
   show(HTML(f'<p><a href="{html_name}" goal="_blank">Open interactive visualization: {html_name}</a></p>'))

We outline the core utility features that energy the complete extraction pipeline. We create a reusable run_extraction perform that sends textual content to the LangExtract engine and generates each JSONL and HTML outputs. We additionally outline helper features to convert the extraction outcomes into tabular rows and preview them interactively within the pocket book.

contract_prompt = textwrap.dedent("""
Extract contract-risk info so as of look.


Rules:
1. Use precise textual content spans from the supply. Do not paraphrase extraction_text.
2. Extract the next lessons when current:
  - social gathering
  - obligation
  - deadline
  - payment_term
  - penalty
  - termination_clause
  - governing_law
3. Add helpful attributes:
  - party_name for obligations or cost phrases when related
  - risk_level as low, medium, or excessive
  - class for the enterprise which means
4. Keep output grounded to the precise wording within the supply.
5. Do not merge non-contiguous spans into one extraction.
""")


contract_examples = [
   lx.data.ExampleData(
       text=(
           "Acme Corp shall deliver the equipment by March 15, 2026. "
           "The Client must pay within 10 days of invoice receipt. "
           "Late payment incurs a 2% monthly penalty. "
           "This agreement is governed by the laws of Ontario."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="Acme Corp",
               attributes={"category": "supplier", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="obligation",
               extraction_text="shall deliver the equipment",
               attributes={"party_name": "Acme Corp", "category": "delivery", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by March 15, 2026",
               attributes={"category": "delivery_deadline", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="The Client",
               attributes={"category": "customer", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="payment_term",
               extraction_text="must pay within 10 days of invoice receipt",
               attributes={"party_name": "The Client", "category": "payment", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="2% monthly penalty",
               attributes={"category": "late_payment", "risk_level": "high"}
           ),
           lx.data.Extraction(
               extraction_class="governing_law",
               extraction_text="laws of Ontario",
               attributes={"category": "legal_jurisdiction", "risk_level": "low"}
           ),
       ]
   )
]


contract_text = """
BluePeak Analytics shall present a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit cost inside 7 calendar days after closing acceptance.
If cost is delayed past 15 days, BluePeak Analytics might droop assist companies and cost curiosity at 1.5% per 30 days.
This Agreement shall be ruled by the legal guidelines of British Columbia.
"""


contract_result, contract_jsonl, contract_html = run_extraction(
   text_or_documents=contract_text,
   prompt_description=contract_prompt,
   examples=contract_examples,
   output_stem="contract_risk_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 1 — Contract danger extraction", contract_result, contract_html)

We construct a contract intelligence extraction workflow by defining an in depth immediate and structured examples. We present LangExtract with annotated training-style examples in order that it understands how to establish entities akin to obligations, deadlines, penalties, and governing legal guidelines. We then run the extraction pipeline on a contract textual content and preview the structured risk-related outputs.

meeting_prompt = textwrap.dedent("""
Extract motion objects from assembly notes so as of look.


Rules:
1. Use precise textual content spans from the supply. No paraphrasing in extraction_text.
2. Extract these lessons when current:
  - assignee
  - action_item
  - due_date
  - blocker
  - resolution
3. Add attributes:
  - precedence as low, medium, or excessive
  - workstream when inferable from native context
  - proprietor for action_item when tied to a named assignee
4. Keep all spans grounded to the supply textual content.
5. Preserve order of look.
""")


meeting_examples = [
   lx.data.ExampleData(
       text=(
           "Sarah will finalize the launch email by Friday. "
           "The team decided to postpone the webinar. "
           "Blocked by missing legal approval."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Sarah",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will finalize the launch email",
               attributes={"owner": "Sarah", "priority": "high", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="due_date",
               extraction_text="by Friday",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="decision",
               extraction_text="decided to postpone the webinar",
               attributes={"priority": "medium", "workstream": "events"}
           ),
           lx.data.Extraction(
               extraction_class="blocker",
               extraction_text="missing legal approval",
               attributes={"priority": "high", "workstream": "compliance"}
           ),
       ]
   )
]


meeting_text = """
Arjun will put together the revised pricing sheet by Tuesday night.
Mina to affirm the enterprise buyer's information residency necessities this week.
The group agreed to ship the pilot just for the Oman area first.
Blocked by pending safety evaluation from the consumer's IT crew.
Ravi will draft the rollback plan earlier than the manufacturing cutover.
"""


meeting_result, meeting_jsonl, meeting_html = run_extraction(
   text_or_documents=meeting_text,
   prompt_description=meeting_prompt,
   examples=meeting_examples,
   output_stem="meeting_action_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 2 — Meeting notes to motion tracker", meeting_result, meeting_html)

We design a gathering intelligence extractor that focuses on motion objects, selections, assignees, and blockers. We once more present instance annotations to assist the mannequin construction meet info constantly. We execute the extraction on assembly notes and show the ensuing structured process tracker.

longdoc_prompt = textwrap.dedent("""
Extract product launch intelligence so as of look.


Rules:
1. Use precise textual content spans from the supply.
2. Extract:
  - firm
  - product
  - launch_date
  - area
  - metric
  - partnership
3. Add attributes:
  - class
  - significance as low, medium, or excessive
4. Keep the extraction grounded within the unique textual content.
5. Do not paraphrase the extracted span.
""")


longdoc_examples = [
   lx.data.ExampleData(
       text=(
           "Nova Robotics launched Atlas Mini in Europe on 12 January 2026. "
           "The company reported 18% faster picking speed and partnered with Helix Warehousing."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="company",
               extraction_text="Nova Robotics",
               attributes={"category": "vendor", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="product",
               extraction_text="Atlas Mini",
               attributes={"category": "product_name", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="region",
               extraction_text="Europe",
               attributes={"category": "market", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="launch_date",
               extraction_text="12 January 2026",
               attributes={"category": "timeline", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="metric",
               extraction_text="18% faster picking speed",
               attributes={"category": "performance_claim", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="partnership",
               extraction_text="partnered with Helix Warehousing",
               attributes={"category": "go_to_market", "significance": "medium"}
           ),
       ]
   )
]


long_text = """
Vertex Dynamics launched FleetSense 3.0 for industrial logistics groups throughout the GCC on 5 February 2026.
The firm stated the discharge improves the accuracy of route deviation detection by 22% and reduces handbook evaluation time by 31%.
In the primary rollout section, the platform will assist Oman and the United Arab Emirates.
Vertex Dynamics additionally partnered with Falcon Telematics to combine dwell driver habits occasions into the dashboard.


A week later, FleetSense 3.0 added a risk-scoring module for security managers.
The replace offers supervisors a every day ranked record of high-risk journeys and exception occasions.
The firm described the module as particularly helpful for oilfield transport operations and contractor fleet audits.


By late February 2026, the crew introduced a pilot with Desert Haul Services.
The pilot covers 240 heavy automobiles and focuses on dashing up incident triage, compliance evaluation, and proof retrieval.
Internal testing confirmed analysts might assemble evaluation packets in beneath 8 minutes as an alternative of the earlier 20 minutes.
"""


longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
   text_or_documents=long_text,
   prompt_description=longdoc_prompt,
   examples=longdoc_examples,
   output_stem="long_document_extraction",
   extraction_passes=3,
   max_workers=8,
   max_char_buffer=1000,
)


preview_result("USE CASE 3 — Long-document extraction", longdoc_result, longdoc_html)


batch_docs = [
   """
   The supplier must replace defective batteries within 14 days of written notice.
   Any unresolved safety issue may trigger immediate suspension of shipments.
   """,
   """
   Priya will circulate the revised onboarding checklist tomorrow morning.
   The team approved the API deprecation plan for the legacy endpoint.
   """,
   """
   Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
   The company claims the assistant reduces nurse intake time by 17%.
   """
]


batch_prompt = textwrap.dedent("""
Extract operationally helpful spans so as of look.


Allowed lessons:
- obligation
- deadline
- penalty
- assignee
- action_item
- resolution
- firm
- product
- launch_date
- metric


Use precise textual content solely and connect a easy attribute:
- source_type
""")


batch_examples = [
   lx.data.ExampleData(
       text="Jordan will submit the report by Monday. Late delivery incurs a service credit.",
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Jordan",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will submit the report",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by Monday",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="service credit",
               attributes={"source_type": "contract"}
           ),
       ]
   )
]


batch_results = []
for idx, doc in enumerate(batch_docs, begin=1):
   res, jsonl_name, html_name = run_extraction(
       text_or_documents=doc,
       prompt_description=batch_prompt,
       examples=batch_examples,
       output_stem=f"batch_doc_{idx}",
       extraction_passes=2,
       max_workers=4,
       max_char_buffer=1200,
   )
   df = extraction_rows(res)
   df.insert(0, "document_id", idx)
   batch_results.append(df)
   print(f"Finished doc {idx} -> {html_name}")


batch_df = pd.concat(batch_results, ignore_index=True)
print("nCombined batch output")
show(batch_df)


print("nContract extraction counts by class")
show(
   extraction_rows(contract_result)
   .groupby("class", as_index=False)
   .measurement()
   .sort_values("measurement", ascending=False)
)


print("nMeeting motion objects solely")
meeting_df = extraction_rows(meeting_result)
show(meeting_df[meeting_df["class"] == "action_item"])


print("nLong-document metrics solely")
longdoc_df = extraction_rows(longdoc_result)
show(longdoc_df[longdoc_df["class"] == "metric"])


final_df = pd.concat([
   extraction_rows(contract_result).assign(use_case="contract_risk"),
   extraction_rows(meeting_result).assign(use_case="meeting_actions"),
   extraction_rows(longdoc_result).assign(use_case="long_document"),
], ignore_index=True)


final_df.to_csv("langextract_tutorial_outputs.csv", index=False)
print("nSaved CSV: langextract_tutorial_outputs.csv")


print("nGenerated recordsdata:")
for identify in [
   contract_jsonl, contract_html,
   meeting_jsonl, meeting_html,
   longdoc_jsonl, longdoc_html,
   "langextract_tutorial_outputs.csv"
]:
   print(" -", identify)

We implement a long-document intelligence pipeline able to extracting structured insights from massive narrative textual content. We run the extraction throughout product launch experiences and operational paperwork, and additionally show batch processing throughout a number of paperwork. We additionally analyze the extracted outcomes, filter key lessons, and export the structured dataset to a CSV file for downstream evaluation.

In conclusion, we constructed a complicated LangExtract workflow that converts advanced textual content paperwork into structured datasets with traceable supply grounding. We ran a number of extraction situations, together with contract danger evaluation, assembly motion monitoring, long-document intelligence extraction, and batch processing throughout a number of paperwork. We additionally visualized the extractions and exported the ultimate structured outcomes right into a CSV file for additional evaluation. Through this course of, we noticed how immediate design, example-based extraction, and scalable processing methods permit us to construct sturdy info extraction methods with minimal code.


Check out the Full Codes hereAlso, be happy to comply with us on Twitter and don’t neglect to be a part of our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization appeared first on MarkTechPost.

Similar Posts