How to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations

In this tutorial, we’ll generate data graphs from plain textual content, conversations, and a number of supply paperwork utilizing kg-gen. We begin by establishing the required dependencies and configuring an LLM via LiteLLM, then we extract entities, predicates, and relationships from easy textual content. As we transfer ahead, we work with longer passages utilizing chunking and clustering, mix data graphs from totally different sources, visualize graph constructions, and analyze them utilizing NetworkX. By the top, we’ll construct an entire workflow that turns unstructured textual content into an interpretable, searchable, visible, and exportable data graph.

Copy Code

import subprocess, sys
def pip_install(pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], test=True)
pip_install([
   "kg-gen",
   "networkx>=3.1",
   "pyvis",
   "matplotlib",
   "python-louvain",
])
import os, json, getpass, textwrap
from collections import Counter
from kg_gen import KGGen
import networkx as nx
from pyvis.community import Network
import matplotlib.pyplot as plt
from IPython.show import HTML, IFrame, show
MODEL    = "openai/gpt-4o-mini"
KEY_NAME = "OPENAI_API_KEY"
def fetch_key(title):
   strive:
       from google.colab import userdata
       v = userdata.get(title)
       if v: return v
   besides Exception:
       cross
   if os.environ.get(title):
       return os.environ[name]
   return getpass.getpass(f"Enter {title}: ")
os.environ[KEY_NAME] = fetch_key(KEY_NAME)
kg = KGGen(mannequin=MODEL, temperature=0.0)
print(f"✓ KGGen initialized with mannequin={MODEL}")

We start by putting in all of the required libraries for data graph era, graph analytics, and visualization. We then import the core packages, together with kg-gen, NetworkX, PyVis, Matplotlib, and show utilities for Colab. We additionally configure the API key and initialize KGGen with the chosen mannequin in order that we are able to begin producing graphs from textual content.

Copy Code

print("n" + "="*70 + "n SECTION 1 — Basic extractionn" + "="*70)
simple_text = (
   "Linda is Josh's mom. Ben is Josh's brother. "
   "Andrew is Josh's father. Josh research at Stanford University."
)
g_basic = kg.generate(input_data=simple_text, context="Family relationships")
print("Entities :", g_basic.entities)
print("Edges    :", g_basic.edges)
print("Relations:")
for s, p, o in g_basic.relations:
   print(f"   ({s}) -[{p}]-> ({o})")
print("n" + "="*70 + "n SECTION 2 — Chunking + clustering on an extended passagen" + "="*70)
big_text = textwrap.dedent("""
   constructing machines that may carry out duties requiring human-like intelligence.
   information slightly than being explicitly programmed. Deep studying is a subset of
   machine studying that makes use of multi-layer neural networks. Neural nets, additionally
   referred to as NNs, are impressed by the construction of the mind.
   researchers in 2017. The Transformer structure underlies fashionable massive
   language fashions resembling GPT, Claude and Gemini. OpenAI launched GPT-3 in
   2020 and GPT-4 in 2023. Anthropic, based in 2021 by former OpenAI
   researchers, develops the Claude household of assistants. Google DeepThoughts
   develops the Gemini household of fashions.
   Stanford University hosts the Stanford AI Lab (SAIL) and the STAIR Lab.
   Researchers at Stanford produced the KGGen library, which extracts
   data graphs from plain textual content utilizing language fashions. KGGen depends on
   DSPy for structured outputs and routes mannequin calls via LiteLLM, which
   helps suppliers together with OpenAI, Anthropic, Google and Ollama.
""").strip()
g_big = kg.generate(
   input_data=big_text,
   chunk_size=800,
   cluster=True,
   context="History and ecosystem of contemporary AI",
)
print(f"Entities ({len(g_big.entities)}): {sorted(g_big.entities)}")
print(f"Edges    ({len(g_big.edges)}): {sorted(g_big.edges)}")
print(f"Relations: {len(g_big.relations)}")
for s, p, o in listing(g_big.relations)[:15]:
   print(f"   ({s}) -[{p}]-> ({o})")
ec = getattr(g_big, "entity_clusters", None) or {}
if ec:
   print("nEntity clusters (canonical → synonyms):")
   for canon, syns in ec.gadgets():
       print(f"   {canon}: {sorted(syns)}")

We first check kg-gen on a easy household relationship instance to extract entities, edges, and relations. We then transfer to an extended AI-focused passage the place we use chunking to deal with bigger textual content and clustering to merge comparable entities or relationship sorts. We print the extracted graph parts and examine entity clusters to perceive how the mannequin organizes associated ideas.

Copy Code

print("n" + "="*70 + "n SECTION 3 — Conversation extractionn" + "="*70)
messages = [
   {"role": "user", "content": "Who founded Anthropic?"},
   {"role": "assistant", "content": "Anthropic was founded in 2021 by Dario Amodei and Daniela Amodei, along with other former OpenAI researchers."},
   {"role": "user", "content": "And what is their main product?"},
   {"role": "assistant", "content": "Anthropic's main product is Claude, a family of large language model assistants."},
]
g_chat = kg.generate(input_data=messages)
print("Relations from dialog:")
for s, p, o in g_chat.relations:
   print(f"   ({s}) -[{p}]-> ({o})")
print("n" + "="*70 + "n SECTION 4 — Aggregating a number of sourcesn" + "="*70)
src1 = "Linda is Joe's mom. Ben is Joe's brother."
src2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph additionally goes by Joe."
g_a = kg.generate(input_data=src1)
g_b = kg.generate(input_data=src2)
mixed = kg.mixture([g_a, g_b])
clustered_combined = kg.cluster(mixed, context="Family relationships")
print("Entities after clustering:", clustered_combined.entities)
print("Relations after clustering:")
for r in clustered_combined.relations:
   print(f"   {r}")
if getattr(clustered_combined, "entity_clusters", None):
   print("Entity clusters:", dict(clustered_combined.entity_clusters))
print("n" + "="*70 + "n SECTION 5 — Built-in vizn" + "="*70)
builtin_path = "kg_builtin.html"
strive:
   KGGen.visualize(g_big, builtin_path, open_in_browser=False)
   print(f"Wrote {builtin_path}")
   show(IFrame(builtin_path, width="100%", peak=520))
besides Exception as e:
   print(f"Built-in visualize failed ({e}); we'll use the customized pyvis viz under.")

We use a conversation-style enter to present how kg-gen extracts structured relations from user-assistant messages. We then generate separate graphs from a number of textual content sources, mixture them, and apply clustering to resolve associated entities resembling “Joe” and “Joseph.” We additionally strive the built-in visualization function and show the generated HTML graph inside Colab.

Copy Code

print("n" + "="*70 + "n SECTION 6 — NetworkX analyticsn" + "="*70)
def kg_to_networkx(graph):
   G = nx.MultiDiGraph()
   for e in graph.entities:
       G.add_node(e)
   for s, p, o in graph.relations:
       G.add_edge(s, o, label=p)
   return G
G = kg_to_networkx(g_big)
print(f"Nodes: {G.number_of_nodes()}   Edges: {G.number_of_edges()}")
H = nx.Graph(G)
deg_cent = nx.degree_centrality(H)
btw_cent = nx.betweenness_centrality(H)
pr_cent  = nx.pagerank(nx.DiGraph(G)) if G.number_of_edges() else {}
def prime(d, okay=8): return sorted(d.gadgets(), key=lambda x: -x[1])[:k]
print("nTop entities by diploma centrality:")
for n, v in prime(deg_cent): print(f"   {n:35s} {v:.3f}")
print("nTop entities by betweenness:")
for n, v in prime(btw_cent): print(f"   {n:35s} {v:.3f}")
print("nTop entities by PageRank:")
for n, v in prime(pr_cent):  print(f"   {n:35s} {v:.3f}")
strive:
   from networkx.algorithms.neighborhood import louvain_communities
   communities = louvain_communities(H, seed=42)
besides Exception:
   import neighborhood as community_louvain
   components = community_louvain.best_partition(H, random_state=42)
   bins = {}
   for n, c in components.gadgets(): bins.setdefault(c, set()).add(n)
   communities = listing(bins.values())
print(f"nDetected {len(communities)} communities:")
for i, c in enumerate(communities):
   print(f"   Community {i}: {sorted(c)}")
pred_counts = Counter(p for _, _, p in g_big.relations)
print("nMost frequent predicates:")
for p, n in pred_counts.most_common(10):
   print(f"   {n:3d}  {p}")
print("n" + "="*70 + "n SECTION 7 — Custom pyvis vizn" + "="*70)
palette = ["#e6194B","#3cb44b","#ffe119","#4363d8","#f58231",
          "#911eb4","#42d4f4","#f032e6","#bfef45","#fabed4"]
node_color = {}
for i, c in enumerate(communities):
   for n in c: node_color[n] = palette[i % len(palette)]
internet = Network(peak="600px", width="100%", directed=True,
             bgcolor="#ffffff", font_color="#222222",
             pocket book=True, cdn_resources="in_line")
internet.barnes_hut(gravity=-12000, spring_length=180)
for n in G.nodes:
   measurement = 12 + 80 * pr_cent.get(n, 0.01)
   internet.add_node(n, label=n, coloration=node_color.get(n, "#888888"),
                measurement=measurement, title=f"PageRank: {pr_cent.get(n,0):.3f}")
for s, o, information in G.edges(information=True):
   internet.add_edge(s, o, label=information.get("label", ""), arrows="to")
pyvis_path = "kg_pyvis.html"
internet.write_html(pyvis_path, pocket book=False, open_browser=False)
print(f"Wrote {pyvis_path}")
show(IFrame(pyvis_path, width="100%", peak=620))

We convert the generated data graph right into a NetworkX graph to allow deeper graph analytics. We calculate diploma centrality, betweenness centrality, PageRank, predicate frequency, and neighborhood construction to establish essential entities and relation patterns. We then create a customized PyVis visualization the place nodes are sized by PageRank and coloured by detected communities.

Copy Code

print("n" + "="*70 + "n SECTION 8 — KG-grounded lookupn" + "="*70)
def lookup(graph, question):
   q = question.decrease()
   hits = [(s,p,o) for s,p,o in graph.relations
           if q in s.lower() or q in p.lower() or q in o.lower()]
   return hits
for q in ["transformer", "Anthropic", "Stanford"]:
   print(f"nQ: inform me about '{q}'")
   for s,p,o in lookup(g_big, q):
       print(f"   ({s}) -[{p}]-> ({o})")
def neighbors(G, node, hops=1):
   if node not in G: return set()
   return set(nx.single_source_shortest_path_length(G.to_undirected(), node, cutoff=hops))
print("n2-hop neighborhood of 'machine studying':")
nb = neighbors(G, "machine studying", hops=2) if "machine studying" in G else set()
print("   ", sorted(nb))
print("n" + "="*70 + "n SECTION 9 — Exportn" + "="*70)
def graph_to_dict(graph):
   return {
       "entities": sorted(graph.entities),
       "edges":    sorted(graph.edges),
       "relations":[list(r) for r in graph.relations],
       "entity_clusters": {okay: sorted(v) for okay,v in (getattr(graph,"entity_clusters",None) or {}).gadgets()},
       "edge_clusters":   {okay: sorted(v) for okay,v in (getattr(graph,"edge_clusters",None)   or {}).gadgets()},
   }
with open("kg.json", "w") as f:
   json.dump(graph_to_dict(g_big), f, indent=2)
G_simple = nx.DiGraph()
for s,o,information in G.edges(information=True):
   if G_simple.has_edge(s,o):
       G_simple[s][o]["label"] += " | " + information["label"]
   else:
       G_simple.add_edge(s,o,label=information["label"])
nx.write_graphml(G_simple, "kg.graphml")
print("Wrote: kg.json, kg.graphml, kg_builtin.html, kg_pyvis.html")
print("n Tutorial full.")

We construct a easy data graph lookup perform that retrieves relations linked to a question time period, resembling “Transformer,” “Anthropic,” or “Stanford.” We additionally examine the two-hop neighborhood of an entity to perceive close by ideas within the graph. Also, we export the data graph as JSON and GraphML for reuse in instruments resembling Gephi and Cytoscape.

In conclusion, we constructed a full data graph era pipeline that strikes from primary extraction to superior graph evaluation and visualization. We used kg-gen to establish entities and relationships, apply clustering to merge comparable ideas, mixture graphs from a number of inputs, and convert the consequence into NetworkX for centrality, PageRank, neighborhood detection, and predicate evaluation. We additionally created interactive visualizations with PyVis, carried out easy KG-grounded lookup, and exported the ultimate graph as JSON and GraphML. Also, we noticed how data graphs assist us remodel uncooked textual content into structured intelligence that’s simpler to discover, analyze, and reuse.

Check out the Full Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t overlook to be a part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up How to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations appeared first on MarkTechPost.