A Coding Information To Construct Superior Doc Intelligence Pipelines With Google LangExtract, OpenAI Fashions, Structured Extraction, And Interactive Visualization

On this tutorial, we discover use Google’s LangExtract library to remodel unstructured textual content into structured, machine-readable data. We start by putting in the required dependencies and securely configuring our OpenAI API key to leverage highly effective language fashions for extraction duties. Additionally, we are going to construct a reusable extraction pipeline that allows us to course of a spread of doc varieties, together with contracts, assembly notes, product bulletins, and operational logs. By means of rigorously designed prompts and instance annotations, we reveal how LangExtract can establish entities, actions, deadlines, dangers, and different structured attributes whereas grounding them to their precise supply spans. We additionally visualize the extracted data and manage it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making programs.

!pip -q set up -U "langextract[openai]" pandas IPython


import os
import json
import textwrap
import getpass
import pandas as pd


OPENAI_API_KEY = getpass.getpass("Enter OPENAI_API_KEY: ")
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


import langextract as lx
from IPython.show import show, HTML

We set up the required libraries, together with LangExtract, Pandas, and IPython, in order that our Colab setting is prepared for structured extraction duties. We securely request the OpenAI API key from the consumer and retailer it as an setting variable for secure entry throughout runtime. We then import the core libraries wanted to run LangExtract, show outcomes, and deal with structured outputs.

MODEL_ID = "gpt-4o-mini"


def run_extraction(
   text_or_documents,
   prompt_description,
   examples,
   output_stem,
   model_id=MODEL_ID,
   extraction_passes=1,
   max_workers=4,
   max_char_buffer=1800,
):
   outcome = lx.extract(
       text_or_documents=text_or_documents,
       prompt_description=prompt_description,
       examples=examples,
       model_id=model_id,
       api_key=os.environ["OPENAI_API_KEY"],
       fence_output=True,
       use_schema_constraints=False,
       extraction_passes=extraction_passes,
       max_workers=max_workers,
       max_char_buffer=max_char_buffer,
   )


   jsonl_name = f"{output_stem}.jsonl"
   html_name = f"{output_stem}.html"


   lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=".")
   html_content = lx.visualize(jsonl_name)


   with open(html_name, "w", encoding="utf-8") as f:
       if hasattr(html_content, "information"):
           f.write(html_content.information)
       else:
           f.write(html_content)


   return outcome, jsonl_name, html_name


def extraction_rows(outcome):
   rows = []
   for ex in outcome.extractions:
       start_pos = None
       end_pos = None
       if getattr(ex, "char_interval", None):
           start_pos = ex.char_interval.start_pos
           end_pos = ex.char_interval.end_pos


       rows.append({
           "class": ex.extraction_class,
           "textual content": ex.extraction_text,
           "attributes": json.dumps(ex.attributes or {}, ensure_ascii=False),
           "begin": start_pos,
           "finish": end_pos,
       })
   return pd.DataFrame(rows)


def preview_result(title, outcome, html_name, max_rows=50):
   print("=" * 80)
   print(title)
   print("=" * 80)
   print(f"Whole extractions: {len(outcome.extractions)}")
   df = extraction_rows(outcome)
   show(df.head(max_rows))
   show(HTML(f'Open interactive visualization: {html_name}
'))

We outline the core utility capabilities that energy the complete extraction pipeline. We create a reusable run_extraction operate that sends textual content to the LangExtract engine and generates each JSONL and HTML outputs. We additionally outline helper capabilities to transform the extraction outcomes into tabular rows and preview them interactively within the pocket book.

contract_prompt = textwrap.dedent("""
Extract contract-risk data so as of look.


Guidelines:
1. Use precise textual content spans from the supply. Don't paraphrase extraction_text.
2. Extract the next lessons when current:
  - social gathering
  - obligation
  - deadline
  - payment_term
  - penalty
  - termination_clause
  - governing_law
3. Add helpful attributes:
  - party_name for obligations or fee phrases when related
  - risk_level as low, medium, or excessive
  - class for the enterprise that means
4. Hold output grounded to the precise wording within the supply.
5. Don't merge non-contiguous spans into one extraction.
""")


contract_examples = [
   lx.data.ExampleData(
       text=(
           "Acme Corp shall deliver the equipment by March 15, 2026. "
           "The Client must pay within 10 days of invoice receipt. "
           "Late payment incurs a 2% monthly penalty. "
           "This agreement is governed by the laws of Ontario."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="Acme Corp",
               attributes={"category": "supplier", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="obligation",
               extraction_text="shall deliver the equipment",
               attributes={"party_name": "Acme Corp", "category": "delivery", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by March 15, 2026",
               attributes={"category": "delivery_deadline", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="party",
               extraction_text="The Client",
               attributes={"category": "customer", "risk_level": "low"}
           ),
           lx.data.Extraction(
               extraction_class="payment_term",
               extraction_text="must pay within 10 days of invoice receipt",
               attributes={"party_name": "The Client", "category": "payment", "risk_level": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="2% monthly penalty",
               attributes={"category": "late_payment", "risk_level": "high"}
           ),
           lx.data.Extraction(
               extraction_class="governing_law",
               extraction_text="laws of Ontario",
               attributes={"category": "legal_jurisdiction", "risk_level": "low"}
           ),
       ]
   )
]


contract_text = """
BluePeak Analytics shall present a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit fee inside 7 calendar days after ultimate acceptance.
If fee is delayed past 15 days, BluePeak Analytics could droop help companies and cost curiosity at 1.5% per thirty days.
This Settlement shall be ruled by the legal guidelines of British Columbia.
"""


contract_result, contract_jsonl, contract_html = run_extraction(
   text_or_documents=contract_text,
   prompt_description=contract_prompt,
   examples=contract_examples,
   output_stem="contract_risk_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 1 — Contract danger extraction", contract_result, contract_html)

We construct a contract intelligence extraction workflow by defining an in depth immediate and structured examples. We offer LangExtract with annotated training-style examples in order that it understands establish entities corresponding to obligations, deadlines, penalties, and governing legal guidelines. We then run the extraction pipeline on a contract textual content and preview the structured risk-related outputs.

meeting_prompt = textwrap.dedent("""
Extract motion objects from assembly notes so as of look.


Guidelines:
1. Use precise textual content spans from the supply. No paraphrasing in extraction_text.
2. Extract these lessons when current:
  - assignee
  - action_item
  - due_date
  - blocker
  - determination
3. Add attributes:
  - precedence as low, medium, or excessive
  - workstream when inferable from native context
  - proprietor for action_item when tied to a named assignee
4. Hold all spans grounded to the supply textual content.
5. Protect order of look.
""")


meeting_examples = [
   lx.data.ExampleData(
       text=(
           "Sarah will finalize the launch email by Friday. "
           "The team decided to postpone the webinar. "
           "Blocked by missing legal approval."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Sarah",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will finalize the launch email",
               attributes={"owner": "Sarah", "priority": "high", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="due_date",
               extraction_text="by Friday",
               attributes={"priority": "medium", "workstream": "marketing"}
           ),
           lx.data.Extraction(
               extraction_class="decision",
               extraction_text="decided to postpone the webinar",
               attributes={"priority": "medium", "workstream": "events"}
           ),
           lx.data.Extraction(
               extraction_class="blocker",
               extraction_text="missing legal approval",
               attributes={"priority": "high", "workstream": "compliance"}
           ),
       ]
   )
]


meeting_text = """
Arjun will put together the revised pricing sheet by Tuesday night.
Mina to verify the enterprise buyer's information residency necessities this week.
The group agreed to ship the pilot just for the Oman area first.
Blocked by pending safety overview from the shopper's IT workforce.
Ravi will draft the rollback plan earlier than the manufacturing cutover.
"""


meeting_result, meeting_jsonl, meeting_html = run_extraction(
   text_or_documents=meeting_text,
   prompt_description=meeting_prompt,
   examples=meeting_examples,
   output_stem="meeting_action_extraction",
   extraction_passes=2,
   max_workers=4,
   max_char_buffer=1400,
)


preview_result("USE CASE 2 — Assembly notes to motion tracker", meeting_result, meeting_html)

We design a gathering intelligence extractor that focuses on motion objects, choices, assignees, and blockers. We once more present instance annotations to assist the mannequin construction meet data persistently. We execute the extraction on assembly notes and show the ensuing structured job tracker.

longdoc_prompt = textwrap.dedent("""
Extract product launch intelligence so as of look.


Guidelines:
1. Use precise textual content spans from the supply.
2. Extract:
  - firm
  - product
  - launch_date
  - area
  - metric
  - partnership
3. Add attributes:
  - class
  - significance as low, medium, or excessive
4. Hold the extraction grounded within the unique textual content.
5. Don't paraphrase the extracted span.
""")


longdoc_examples = [
   lx.data.ExampleData(
       text=(
           "Nova Robotics launched Atlas Mini in Europe on 12 January 2026. "
           "The company reported 18% faster picking speed and partnered with Helix Warehousing."
       ),
       extractions=[
           lx.data.Extraction(
               extraction_class="company",
               extraction_text="Nova Robotics",
               attributes={"category": "vendor", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="product",
               extraction_text="Atlas Mini",
               attributes={"category": "product_name", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="region",
               extraction_text="Europe",
               attributes={"category": "market", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="launch_date",
               extraction_text="12 January 2026",
               attributes={"category": "timeline", "significance": "medium"}
           ),
           lx.data.Extraction(
               extraction_class="metric",
               extraction_text="18% faster picking speed",
               attributes={"category": "performance_claim", "significance": "high"}
           ),
           lx.data.Extraction(
               extraction_class="partnership",
               extraction_text="partnered with Helix Warehousing",
               attributes={"category": "go_to_market", "significance": "medium"}
           ),
       ]
   )
]


long_text = """
Vertex Dynamics launched FleetSense 3.0 for industrial logistics groups throughout the GCC on 5 February 2026.
The corporate mentioned the discharge improves the accuracy of route deviation detection by 22% and reduces handbook overview time by 31%.
Within the first rollout part, the platform will help Oman and the United Arab Emirates.
Vertex Dynamics additionally partnered with Falcon Telematics to combine dwell driver conduct occasions into the dashboard.


Per week later, FleetSense 3.0 added a risk-scoring module for security managers.
The replace offers supervisors a each day ranked listing of high-risk journeys and exception occasions.
The corporate described the module as particularly beneficial for oilfield transport operations and contractor fleet audits.


By late February 2026, the workforce introduced a pilot with Desert Haul Providers.
The pilot covers 240 heavy automobiles and focuses on rushing up incident triage, compliance overview, and proof retrieval.
Inside testing confirmed analysts might assemble overview packets in below 8 minutes as an alternative of the earlier 20 minutes.
"""


longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
   text_or_documents=long_text,
   prompt_description=longdoc_prompt,
   examples=longdoc_examples,
   output_stem="long_document_extraction",
   extraction_passes=3,
   max_workers=8,
   max_char_buffer=1000,
)


preview_result("USE CASE 3 — Lengthy-document extraction", longdoc_result, longdoc_html)


batch_docs = [
   """
   The supplier must replace defective batteries within 14 days of written notice.
   Any unresolved safety issue may trigger immediate suspension of shipments.
   """,
   """
   Priya will circulate the revised onboarding checklist tomorrow morning.
   The team approved the API deprecation plan for the legacy endpoint.
   """,
   """
   Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
   The company claims the assistant reduces nurse intake time by 17%.
   """
]


batch_prompt = textwrap.dedent("""
Extract operationally helpful spans so as of look.


Allowed lessons:
- obligation
- deadline
- penalty
- assignee
- action_item
- determination
- firm
- product
- launch_date
- metric


Use precise textual content solely and fix a easy attribute:
- source_type
""")


batch_examples = [
   lx.data.ExampleData(
       text="Jordan will submit the report by Monday. Late delivery incurs a service credit.",
       extractions=[
           lx.data.Extraction(
               extraction_class="assignee",
               extraction_text="Jordan",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="action_item",
               extraction_text="will submit the report",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="deadline",
               extraction_text="by Monday",
               attributes={"source_type": "meeting"}
           ),
           lx.data.Extraction(
               extraction_class="penalty",
               extraction_text="service credit",
               attributes={"source_type": "contract"}
           ),
       ]
   )
]


batch_results = []
for idx, doc in enumerate(batch_docs, begin=1):
   res, jsonl_name, html_name = run_extraction(
       text_or_documents=doc,
       prompt_description=batch_prompt,
       examples=batch_examples,
       output_stem=f"batch_doc_{idx}",
       extraction_passes=2,
       max_workers=4,
       max_char_buffer=1200,
   )
   df = extraction_rows(res)
   df.insert(0, "document_id", idx)
   batch_results.append(df)
   print(f"Completed doc {idx} -> {html_name}")


batch_df = pd.concat(batch_results, ignore_index=True)
print("nCombined batch output")
show(batch_df)


print("nContract extraction counts by class")
show(
   extraction_rows(contract_result)
   .groupby("class", as_index=False)
   .dimension()
   .sort_values("dimension", ascending=False)
)


print("nMeeting motion objects solely")
meeting_df = extraction_rows(meeting_result)
show(meeting_df[meeting_df["class"] == "action_item"])


print("nLong-document metrics solely")
longdoc_df = extraction_rows(longdoc_result)
show(longdoc_df[longdoc_df["class"] == "metric"])


final_df = pd.concat([
   extraction_rows(contract_result).assign(use_case="contract_risk"),
   extraction_rows(meeting_result).assign(use_case="meeting_actions"),
   extraction_rows(longdoc_result).assign(use_case="long_document"),
], ignore_index=True)


final_df.to_csv("langextract_tutorial_outputs.csv", index=False)
print("nSaved CSV: langextract_tutorial_outputs.csv")


print("nGenerated recordsdata:")
for title in [
   contract_jsonl, contract_html,
   meeting_jsonl, meeting_html,
   longdoc_jsonl, longdoc_html,
   "langextract_tutorial_outputs.csv"
]:
   print(" -", title)

We implement a long-document intelligence pipeline able to extracting structured insights from massive narrative textual content. We run the extraction throughout product launch reviews and operational paperwork, and likewise reveal batch processing throughout a number of paperwork. We additionally analyze the extracted outcomes, filter key lessons, and export the structured dataset to a CSV file for downstream evaluation.

In conclusion, we constructed a complicated LangExtract workflow that converts complicated textual content paperwork into structured datasets with traceable supply grounding. We ran a number of extraction eventualities, together with contract danger evaluation, assembly motion monitoring, long-document intelligence extraction, and batch processing throughout a number of paperwork. We additionally visualized the extractions and exported the ultimate structured outcomes right into a CSV file for additional evaluation. By means of this course of, we noticed how immediate design, example-based extraction, and scalable processing strategies permit us to construct sturdy data extraction programs with minimal code.

Take a look at the Full Codes right here. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be part of us on telegram as nicely.

Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s developments at present: learn extra, subscribe to our e-newsletter, and turn into a part of the NextTech group at NextTech-news.com

What's Hot

Sony Declares True RGB TV Expertise

OpenAI pauses Stargate UK over vitality prices

MiniMax Launches MMX-CLI for AI Agent Automation

A Coding Information to Construct Superior Doc Intelligence Pipelines with Google LangExtract, OpenAI Fashions, Structured Extraction, and Interactive Visualization

Sigmoid vs ReLU Activation Capabilities: The Inference Price of Dropping Geometric Context

Google AI Analysis Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Analysis Paper Writing

A Complete Implementation Information to ModelScope for Mannequin Search, Inference, Superb-Tuning, Analysis, and Export

Sony Declares True RGB TV Expertise

OpenAI pauses Stargate UK over vitality prices

MiniMax Launches MMX-CLI for AI Agent Automation

Sony Declares True RGB TV Expertise

OpenAI pauses Stargate UK over vitality prices

MiniMax Launches MMX-CLI for AI Agent Automation

What's Hot

A Coding Information to Construct Superior Doc Intelligence Pipelines with Google LangExtract, OpenAI Fashions, Structured Extraction, and Interactive Visualization

Related Posts

Subscribe For Latest Updates