Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
In this tutorial, we construct a full Crawlee-for-Python workflow that covers atmosphere setup, native web site technology, static crawling, dynamic crawling, structured extraction, and downstream information processing. We start by configuring a appropriate Crawlee runtime with pinned Pydantic help, Playwright browser set up, persistent storage directories, and Colab-safe execution dealing with. We then generate a reasonable native demo web site containing product pages, documentation pages, weblog content material, inside hyperlinks, robots.txt guidelines, JSON-LD metadata, and JavaScript-rendered catalog objects. Using BeautifulSoupCrawler, we carry out quick recursive HTML crawling and extract web page titles, metadata, textual content previews, outgoing hyperlinks, product attributes, documentation headings, code blocks, and weblog tags. With ParselCrawler, we run exact CSS- and XPath-based extraction on product element pages. With PlaywrightCrawler, we render JavaScript content material in a headless Chromium browser, wait for dynamic DOM parts to look, extract client-side information, and seize full-page screenshots.
Setting Up the Crawlee Python Runtime and Helpers
import os
import sys
import re
import csv
import json
import time
import math
import shutil
import socket
import hashlib
import asyncio
import textwrap
import subprocess
import threading
from pathlib import Path
from functools import partial
from http.server import ThreadingHTTPServer, SimpleHTTPRequestHandler
from importlib.metadata import model, PackageNotFoundError
SETUP_SENTINEL = "/content material/.crawlee_python_tutorial_setup_done_v2"
def sh(command, examine=True, quiet=False):
print(f"n$ {command}")
consequence = subprocess.run(
command,
shell=True,
textual content=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
)
if not quiet and consequence.stdout:
print(consequence.stdout[-5000:])
if examine and consequence.returncode != 0:
increase RuntimeError(f"Command failed with exit code {consequence.returncode}: {command}")
return consequence.returncode == 0
def package_version(package_name):
strive:
return model(package_name)
besides PackageNotFoundError:
return None
def is_good_pydantic_version(v):
if not v:
return False
m = re.match(r"^(d+).(d+)", v)
if not m:
return False
main, minor = int(m.group(1)), int(m.group(2))
return main == 2 and minor == 11
current_crawlee = package_version("crawlee")
current_pydantic = package_version("pydantic")
needs_setup = (
not os.path.exists(SETUP_SENTINEL)
or current_crawlee is None
or not is_good_pydantic_version(current_pydantic)
)
if needs_setup:
print("PHASE 1: Installing appropriate Crawlee + Pydantic + Playwright dependencies.")
print("After this finishes, Colab will restart mechanically. Then run this identical cell once more.")
sh(f'{sys.executable} -m pip uninstall -y crawlee pydantic pydantic-core', examine=False)
sh(
f'{sys.executable} -m pip set up -q -U '
f'"pydantic>=2.11,<2.12" '
f'"crawlee[all]" '
f'pandas matplotlib networkx nest_asyncio beautifulsoup4 parsel'
)
sh(f'{sys.executable} -m playwright set up --with-deps chromium', examine=False)
Path(SETUP_SENTINEL).write_text("finished", encoding="utf-8")
print("nInstalled variations:")
sh(f'{sys.executable} -m pip present crawlee pydantic pydantic-core', examine=False)
strive:
import google.colab
print("nRestarting Colab runtime now. After it reconnects, run this identical cell once more.")
os.kill(os.getpid(), 9)
besides Exception:
increase SystemExit("Setup full. Restart the runtime/kernel manually, then run this cell once more.")
print("PHASE 2: Dependencies are prepared. Running the Crawlee tutorial.")
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import nest_asyncio
nest_asyncio.apply()
TUTORIAL_ROOT = Path("/content material/crawlee_python_advanced_tutorial")
SITE_DIR = TUTORIAL_ROOT / "demo_site"
OUTPUT_DIR = TUTORIAL_ROOT / "outputs"
STORAGE_DIR = TUTORIAL_ROOT / "crawlee_storage"
SCREENSHOT_DIR = OUTPUT_DIR / "screenshots"
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR]:
if path.exists():
shutil.rmtree(path)
for path in [SITE_DIR, OUTPUT_DIR, STORAGE_DIR, SCREENSHOT_DIR]:
path.mkdir(dad and mom=True, exist_ok=True)
os.environ["CRAWLEE_STORAGE_DIR"] = str(STORAGE_DIR)
os.environ["CRAWLEE_LOG_LEVEL"] = "INFO"
os.environ["CRAWLEE_PURGE_ON_START"] = "true"
from crawlee import Glob, ConcurrencySettings
from crawlee.crawlers import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
ParselCrawler,
ParselCrawlingContext,
PlaywrightCrawler,
PlaywrightCrawlingContext,
)
strive:
import crawlee
print("Crawlee model:", crawlee.__version__)
besides Exception:
print("Crawlee imported efficiently.")
print("Pydantic model:", package_version("pydantic"))
def safe_slug(worth):
worth = re.sub(r"[^a-zA-Z0-9]+", "-", str(worth)).strip("-").decrease()
return worth or "merchandise"
def money_to_float(worth):
if worth is None:
return None
cleaned = re.sub(r"[^0-9.]", "", str(worth))
return float(cleaned) if cleaned else None
def normalize_text(worth, max_len=None):
worth = re.sub(r"s+", " ", worth or "").strip()
return worth[:max_len] if max_len else worth
def write_file(path, content material):
path = Path(path)
path.dad or mum.mkdir(dad and mom=True, exist_ok=True)
path.write_text(textwrap.dedent(content material).strip() + "n", encoding="utf-8")
We start by getting ready the whole Colab runtime for the Crawlee tutorial. We set up appropriate variations of Crawlee, Pydantic, Playwright, and the required evaluation libraries, and deal with the automated restart required after setup. We then configure storage folders, atmosphere variables, crawler imports, and helper capabilities to make sure the remainder of the workflow runs easily.
Generating the Demo Website and Product Catalog
PRODUCTS = [
{
"sku": "CRW-101",
"name": "Crawler Reliability Kit",
"category": "automation",
"price": 149.0,
"rating": 4.8,
"stock": 18,
"features": ["retry policy", "queue replay", "structured logs"],
"associated": ["CRW-202", "CRW-303"],
},
{
"sku": "CRW-202",
"title": "Playwright Rendering Pack",
"class": "browser",
"worth": 249.0,
"score": 4.7,
"inventory": 9,
"options": ["headless chromium", "screenshots", "dynamic DOM extraction"],
"associated": ["CRW-101", "CRW-404"],
},
{
"sku": "CRW-303",
"title": "RAG Extraction Bundle",
"class": "ai-data",
"worth": 199.0,
"score": 4.9,
"inventory": 13,
"options": ["clean text chunks", "metadata capture", "JSONL export"],
"associated": ["CRW-101", "CRW-505"],
},
{
"sku": "CRW-404",
"title": "Anti-Fragile Session Toolkit",
"class": "resilience",
"worth": 299.0,
"score": 4.6,
"inventory": 5,
"options": ["session rotation", "state recovery", "graceful failures"],
"associated": ["CRW-202", "CRW-505"],
},
{
"sku": "CRW-505",
"title": "Data Export Control Plane",
"class": "storage",
"worth": 179.0,
"score": 4.5,
"inventory": 21,
"options": ["datasets", "key-value store", "CSV and JSON export"],
"associated": ["CRW-303", "CRW-404"],
},
]
def structure(title, physique, extra_head="", extra_script=""):
css = """
<type>
physique {
font-family: Inter, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
margin: 0;
background: #f7f7fb;
coloration: #1f2430;
}
header {
background: #202638;
coloration: white;
padding: 28px 40px;
}
nav a {
coloration: #dbe7ff;
margin-right: 18px;
text-decoration: none;
font-weight: 600;
}
primary {
max-width: 1050px;
margin: 0 auto;
padding: 32px;
}
.grid {
show: grid;
grid-template-columns: repeat(auto-fit, minmax(230px, 1fr));
hole: 18px;
}
.card, article, .panel {
background: white;
border: 1px stable #e5e7ef;
border-radius: 16px;
padding: 20px;
box-shadow: 0 8px 25px rgba(20, 30, 60, 0.05);
}
.worth {
font-size: 1.3rem;
font-weight: 800;
}
.tag {
show: inline-block;
background: #edf2ff;
border: 1px stable #d6e0ff;
border-radius: 999px;
padding: 4px 10px;
margin: 3px;
font-size: 0.82rem;
}
.stock-low {
coloration: #b42318;
font-weight: 700;
}
.stock-ok {
coloration: #067647;
font-weight: 700;
}
code, pre {
background: #111827;
coloration: #d1fae5;
border-radius: 10px;
}
pre {
padding: 16px;
overflow-x: auto;
}
footer {
padding: 30px 40px;
coloration: #606779;
}
</type>
"""
return f"""
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta title="viewport" content material="width=device-width, initial-scale=1">
<meta title="description" content material="{title} web page for a Crawlee Python tutorial demo web site.">
<title>{title}</title>
{css}
{extra_head}
</head>
<physique>
<header>
<h1>{title}</h1>
<nav>
<a href="/index.html">Home</a>
<a href="/merchandise/product-crw-101.html">Products</a>
<a href="/docs/getting-started.html">Docs</a>
<a href="/weblog/crawling-at-scale.html">Blog</a>
<a href="/dynamic.html">Dynamic JS Page</a>
<a href="/admin/hidden.html">Admin</a>
</nav>
</header>
<primary>{physique}</primary>
<footer>Local demo web site generated for Crawlee Python superior tutorial.</footer>
{extra_script}
</physique>
</html>
"""
def build_demo_site():
write_file(
SITE_DIR / "robots.txt",
"""
User-agent: *
Disallow: /admin/
Allow: /
""",
)
product_cards = []
for product in PRODUCTS:
product_cards.append(
f"""
<div class="card product-teaser" data-sku="{product['sku']}" data-category="{product['category']}">
<h2><a href="/merchandise/product-{safe_slug(product['sku'])}.html">{product['name']}</a></h2>
<p>{product['category']} crawler module with score {product['rating']}.</p>
<p class="worth" data-price="{product['price']}">${product['price']:.2f}</p>
<p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
</div>
"""
)
write_file(
SITE_DIR / "index.html",
structure(
"Crawlee Demo Commerce + Docs Hub",
f"""
<part class="panel">
<h2>Why this website exists</h2>
<p>
This native web site provides us predictable pages for testing Crawlee with out scraping a third-party web site.
We embody static HTML pages, documentation pages, product element pages, a weblog article, robots.txt,
and a JavaScript-rendered web page.
</p>
</part>
<h2>Featured crawler modules</h2>
<part class="grid">
{''.be a part of(product_cards)}
</part>
<part class="panel">
<h2>Internal hyperlinks for recursive crawling</h2>
<ul>
<li><a href="/docs/getting-started.html">Getting began information</a></li>
<li><a href="/docs/advanced-routing.html">Advanced routing information</a></li>
<li><a href="/weblog/crawling-at-scale.html">Crawling at scale article</a></li>
<li><a href="/dynamic.html">JavaScript-rendered catalog</a></li>
<li><a href="/admin/hidden.html">Admin web page blocked by robots and crawler filters</a></li>
</ul>
</part>
""",
),
)
for product in PRODUCTS:
related_links = "n".be a part of(
f'<li><a class="related-link" href="/merchandise/product-{safe_slug(sku)}.html">{sku}</a></li>'
for sku in product["related"]
)
feature_list = "n".be a part of(f"<li>{characteristic}</li>" for characteristic in product["features"])
json_ld = json.dumps(
{
"@context": "https://schema.org",
"@sort": "Product",
"sku": product["sku"],
"title": product["name"],
"class": product["category"],
"gives": {
"@sort": "Offer",
"worth": product["price"],
"priceCurrency": "USD",
},
"combinationRating": {
"@sort": "MixtureRating",
"ratingValue": product["rating"],
},
},
indent=2,
)
write_file(
SITE_DIR / "merchandise" / f"product-{safe_slug(product['sku'])}.html",
structure(
f"{product['name']} | Product Detail",
f"""
<article class="product"
data-sku="{product['sku']}"
data-category="{product['category']}"
data-rating="{product['rating']}"
data-stock="{product['stock']}">
<h2 class="product-title">{product['name']}</h2>
<p class="sku">SKU: <sturdy>{product['sku']}</sturdy></p>
<p class="class">Category: <sturdy>{product['category']}</sturdy></p>
<p class="worth" data-price="{product['price']}">${product['price']:.2f}</p>
<p class="score">Rating: {product['rating']} / 5</p>
<p class="{'stock-low' if product['stock'] < 10 else 'stock-ok'}">Stock: {product['stock']}</p>
<h3>Features</h3>
<ul class="options">{feature_list}</ul>
<h3>Related modules</h3>
<ul>{related_links}</ul>
</article>
<script sort="software/ld+json">{json_ld}</script>
""",
),
)
We create a reasonable product catalog that turns into the structured information supply for our demo web site. We outline reusable HTML structure logic, styling, navigation, and web page templates to make the native web site look and behave like a small business and documentation portal. We then generate the homepage and product element pages, together with costs, rankings, inventory ranges, product options, associated hyperlinks, and JSON-LD metadata.
Adding Docs, Blog, Dynamic, and Admin Pages
write_file(
SITE_DIR / "docs" / "getting-started.html",
structure(
"Getting Started with Reliable Crawlers",
"""
<article class="doc" data-doc-id="getting-started">
<h2>HTTP-first crawling technique</h2>
<p>
We begin with HTTP crawlers as a result of they're light-weight and environment friendly.
Browser crawling is reserved for pages that want JavaScript rendering.
</p>
<h2>Core extraction fields</h2>
<p>
Each crawler extracts URL, title, web page sort, textual content abstract, outgoing hyperlinks, and page-specific metadata.
</p>
<pre><code>crawler = BeautifulSoupCrawler(max_requests_per_crawl=20)</code></pre>
<p><a href="/docs/advanced-routing.html">Next: superior routing</a></p>
</article>
""",
),
)
write_file(
SITE_DIR / "docs" / "advanced-routing.html",
structure(
"Advanced Routing and Storage",
"""
<article class="doc" data-doc-id="advanced-routing">
<h2>Queue filtering</h2>
<p>
We filter hyperlinks to maintain the crawl centered on the identical native area and skip admin pages.
</p>
<h2>Storage design</h2>
<p>
Structured rows go to datasets. Binary screenshots and snapshots go to a key-value retailer.
</p>
<pre><code>await context.enqueue_links(embody=[Glob("https://example.com/**")])</code></pre>
<p><a href="/weblog/crawling-at-scale.html">Read the scaling article</a></p>
</article>
""",
),
)
write_file(
SITE_DIR / "weblog" / "crawling-at-scale.html",
structure(
"Crawling at Scale",
"""
<article class="blog-post" data-author="demo-team" data-reading-time="7">
<h2>Scaling crawler jobs with out shedding reliability</h2>
<p>
Production crawlers want managed concurrency, retry habits, secure request queues,
structured exports, and monitoring-ready output.
</p>
<p>
For AI information workflows, we additionally normalize textual content, protect supply URLs, create chunks,
and document extraction provenance.
</p>
<span class="tag">queues</span>
<span class="tag">datasets</span>
<span class="tag">rag</span>
<span class="tag">playwright</span>
</article>
""",
),
)
dynamic_items = json.dumps(
[
{
"sku": "JS-900",
"name": "Dynamic Inventory Scanner",
"price": 329.0,
"stock": 4,
"desc": "Rendered only after JavaScript executes.",
},
{
"sku": "JS-901",
"name": "Client-Side Review Miner",
"price": 279.0,
"stock": 11,
"desc": "Created by browser-side DOM manipulation.",
},
{
"sku": "JS-902",
"name": "Async Catalog Watcher",
"price": 389.0,
"stock": 7,
"desc": "Useful for testing PlaywrightCrawler extraction.",
},
],
indent=2,
)
dynamic_script = f"""
<script>
const dynamicItems = {dynamic_items};
operate renderItems() {{
const root = doc.querySelector("#dynamic-products");
root.innerHTML = "";
for (const merchandise of dynamicItems) {{
const card = doc.createElement("div");
card.className = "card js-card";
card.dataset.sku = merchandise.sku;
card.dataset.worth = merchandise.worth;
card.dataset.inventory = merchandise.inventory;
card.innerHTML = `
<h3>${{merchandise.title}}</h3>
<p class="desc">${{merchandise.desc}}</p>
<p class="worth">$${{merchandise.worth.toFixed(2)}}</p>
<p class="${{merchandise.inventory < 8 ? "stock-low" : "stock-ok"}}">Stock: ${{merchandise.inventory}}</p>
`;
root.appendChild(card);
}}
doc.querySelector("#render-status").textContent =
"Rendered " + dynamicItems.size + " JavaScript objects.";
}}
setTimeout(renderItems, 600);
</script>
"""
write_file(
SITE_DIR / "dynamic.html",
structure(
"JavaScript Rendered Catalog",
"""
<part class="panel">
<h2>Dynamic content material check</h2>
<p>
A plain HTTP crawler can obtain this web page, but it surely is not going to see the playing cards beneath till JavaScript runs.
PlaywrightCrawler opens a actual browser and extracts the rendered DOM.
</p>
<p id="render-status">Waiting for JavaScript rendering...</p>
</part>
<part id="dynamic-products" class="grid"></part>
""",
extra_script=dynamic_script,
),
)
write_file(
SITE_DIR / "admin" / "hidden.html",
structure(
"Hidden Admin Page",
"""
<article class="panel">
<h2>This web page must be skipped</h2>
<p>
The crawler excludes this admin path to exhibit management over the rawl scope
</p>
</article>
""",
),
)
build_demo_site()
print(f"Demo website generated at: {SITE_DIR}")
class QuietHandler(SimpleHTTPRequestHandler):
def log_message(self, format, *args):
go
def start_local_server(listing):
probe = socket.socket()
probe.bind(("127.0.0.1", 0))
port = probe.getsockname()[1]
probe.shut()
handler = partial(QuietHandler, listing=str(listing))
httpd = ThreadingHTTPServer(("127.0.0.1", port), handler)
thread = threading.Thread(goal=httpd.serve_forever, daemon=True)
thread.begin()
base_url = f"http://127.0.0.1:{port}"
time.sleep(0.5)
return httpd, base_url
def extract_json_ld(soup):
blocks = []
for script in soup.choose('script[type="application/ld+json"]'):
uncooked = script.string or script.get_text()
if not uncooked:
proceed
strive:
blocks.append(json.hundreds(uncooked))
besides Exception:
blocks.append({"uncooked": uncooked})
return blocks
def write_json(path, rows):
path = Path(path)
path.write_text(json.dumps(rows, ensure_ascii=False, indent=2), encoding="utf-8")
def write_csv(path, rows):
path = Path(path)
if not rows:
path.write_text("", encoding="utf-8")
return
flattened = []
for row in rows:
flat = {}
for key, worth in row.objects():
if isinstance(worth, (checklist, dict)):
flat[key] = json.dumps(worth, ensure_ascii=False)
else:
flat[key] = worth
flattened.append(flat)
fieldnames = sorted({key for row in flattened for key in row.keys()})
with path.open("w", newline="", encoding="utf-8") as f:
author = csv.DictWriter(f, fieldnames=fieldnames)
author.writeheader()
author.writerows(flattened)
We develop the demo web site by including documentation pages, a weblog article, a JavaScript-rendered catalog web page, and an admin web page supposed to be excluded from crawling. We use these pages to check completely different crawling eventualities, together with static HTML extraction, documentation parsing, weblog metadata extraction, dynamic browser rendering, and crawl filtering. We additionally begin a native HTTP server and outline utilities to extract JSON-LD content material and export crawl outcomes to JSON and CSV.
Static Crawling with BeautifulSoupCrawler and ParselCrawler
async def run_beautifulsoup_crawl(base_url):
print("n=== 1) BeautifulSoupCrawler: quick recursive HTTP crawl ===")
rows = []
crawler = BeautifulSoupCrawler(
parser="html.parser",
max_requests_per_crawl=30,
max_request_retries=1,
respect_robots_txt_file=True,
concurrency_settings=ConcurrencySettings(
desired_concurrency=4,
max_concurrency=6,
),
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
soup = context.soup
url = context.request.url
title = normalize_text(soup.title.get_text(" ", strip=True) if soup.title else "")
meta_description = ""
meta_tag = soup.discover("meta", attrs={"title": "description"})
if meta_tag:
meta_description = normalize_text(meta_tag.get("content material", ""))
out_links = []
for a in soup.choose("a[href]"):
href = a.get("href")
label = normalize_text(a.get_text(" ", strip=True), 120)
out_links.append({"href": href, "label": label})
page_text = normalize_text(soup.get_text(" ", strip=True), 1000)
if "/merchandise/" in url:
page_type = "product"
elif "/docs/" in url:
page_type = "documentation"
elif "/weblog/" in url:
page_type = "weblog"
elif "/dynamic" in url:
page_type = "dynamic-shell"
else:
page_type = "index"
row = {
"supply": "beautifulsoup-http",
"url": url,
"title": title,
"page_type": page_type,
"meta_description": meta_description,
"text_preview": page_text,
"out_links": out_links,
"json_ld": extract_json_ld(soup),
"extracted_at_unix": time.time(),
}
if page_type == "product":
article = soup.select_one("article.product")
if article:
price_node = soup.select_one(".worth")
row["product"] = {
"sku": article.get("data-sku"),
"class": article.get("data-category"),
"title": normalize_text(
soup.select_one(".product-title").get_text(" ", strip=True)
if soup.select_one(".product-title")
else ""
),
"worth": money_to_float(price_node.get("data-price") if price_node else None),
"score": float(article.get("data-rating")) if article.get("data-rating") else None,
"inventory": int(article.get("data-stock")) if article.get("data-stock") else None,
"options": [
normalize_text(li.get_text(" ", strip=True))
for li in soup.select(".features li")
],
}
if page_type == "documentation":
row["doc"] = {
"headings": [
normalize_text(h.get_text(" ", strip=True))
for h in soup.select("h2, h3")
],
"code_blocks": [
normalize_text(code.get_text(" ", strip=True))
for code in soup.select("pre code")
],
}
if page_type == "weblog":
row["blog"] = {
"creator": soup.select_one(".blog-post").get("data-author") if soup.select_one(".blog-post") else None,
"reading_time": soup.select_one(".blog-post").get("data-reading-time") if soup.select_one(".blog-post") else None,
"tags": [
normalize_text(tag.get_text(" ", strip=True))
for tag in soup.select(".tag")
],
}
rows.append(row)
await context.push_data(row)
await context.enqueue_links(
embody=[Glob(f"{base_url}/**")],
exclude=[
Glob(f"{base_url}/admin/**"),
Glob(f"{base_url}/dynamic.html"),
],
)
await crawler.run([f"{base_url}/index.html"])
write_json(OUTPUT_DIR / "beautifulsoup_crawl.json", rows)
write_csv(OUTPUT_DIR / "beautifulsoup_crawl.csv", rows)
print(f"BeautifulSoup rows extracted: {len(rows)}")
return rows
async def run_parsel_precision_crawl(base_url):
print("n=== 2) ParselCrawler: exact CSS/XPath extraction from product pages ===")
rows = []
product_urls = [
f"{base_url}/products/product-{safe_slug(product['sku'])}.html"
for product in PRODUCTS
]
crawler = ParselCrawler(
max_requests_per_crawl=len(product_urls),
max_request_retries=1,
concurrency_settings=ConcurrencySettings(
desired_concurrency=5,
max_concurrency=8,
),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
selector = context.selector
title = selector.css("title::textual content").get()
sku = selector.css("article.product::attr(data-sku)").get()
class = selector.css("article.product::attr(data-category)").get()
score = selector.css("article.product::attr(data-rating)").get()
inventory = selector.css("article.product::attr(data-stock)").get()
title = selector.css(".product-title::textual content").get()
worth = selector.css(".worth::attr(data-price)").get()
options = [
normalize_text(feature)
for feature in selector.css(".features li::text").getall()
]
row = {
"supply": "parsel-precision",
"url": context.request.url,
"title": normalize_text(title),
"sku": sku,
"title": normalize_text(title),
"class": class,
"worth": money_to_float(worth),
"score": float(score) if score else None,
"inventory": int(inventory) if inventory else None,
"options": options,
"xpath_title": normalize_text(selector.xpath("//title/textual content()").get()),
}
rows.append(row)
await context.push_data(row)
await crawler.run(product_urls)
write_json(OUTPUT_DIR / "parsel_products.json", rows)
write_csv(OUTPUT_DIR / "parsel_products.csv", rows)
print(f"Parsel product rows extracted: {len(rows)}")
return rows
We implement the static crawling a part of the workflow utilizing BeautifulSoupCrawler and ParselCrawler. With BeautifulSoupCrawler, we recursively crawl the native web site and extract web page titles, metadata, textual content previews, outgoing hyperlinks, product particulars, documentation headings, code blocks, and weblog tags. With ParselCrawler, we carry out extra focused CSS and XPath extraction from product pages to gather clear product-level fields, together with SKU, class, worth, score, inventory, and options.
Dynamic Rendering with PlaywrightCrawler and Link Graphs
async def run_playwright_dynamic_crawl(base_url):
print("n=== 3) PlaywrightCrawler: browser-rendered JavaScript crawl ===")
rows = []
crawler = PlaywrightCrawler(
max_requests_per_crawl=2,
max_request_retries=1,
headless=True,
browser_type="chromium",
browser_launch_options={
"args": ["--no-sandbox", "--disable-dev-shm-usage"],
},
goto_options={
"wait_until": "domcontentloaded",
},
concurrency_settings=ConcurrencySettings(
desired_concurrency=1,
max_concurrency=2,
),
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
await context.web page.wait_for_selector(".js-card", timeout=10000)
playing cards = await context.web page.locator(".js-card").evaluate_all(
"""
(playing cards) => playing cards.map((card) => {
const h3 = card.querySelector("h3");
const desc = card.querySelector(".desc");
const worth = card.querySelector(".worth");
return {
sku: card.dataset.sku,
title: h3 ? h3.textContent.trim() : null,
description: desc ? desc.textContent.trim() : null,
price_text: worth ? worth.textContent.trim() : null,
worth: Number(card.dataset.worth),
inventory: Number(card.dataset.inventory),
rendered_text: card.innerText.trim()
};
})
"""
)
screenshot_bytes = await context.web page.screenshot(full_page=True)
screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
screenshot_path.write_bytes(screenshot_bytes)
strive:
kvs = await context.get_key_value_store()
await kvs.set_value(
key="dynamic-catalog-full-page",
worth=screenshot_bytes,
content_type="picture/png",
)
besides Exception as exc:
print("Key-value retailer screenshot save skipped:", repr(exc))
for card in playing cards:
row = {
**card,
"supply": "playwright-rendered-js",
"url": context.request.url,
"screenshot_path": str(screenshot_path),
"extracted_at_unix": time.time(),
}
rows.append(row)
await context.push_data(rows)
strive:
await crawler.run([f"{base_url}/dynamic.html"])
besides Exception as exc:
print("Playwright part failed gracefully.")
print("Reason:", repr(exc))
write_json(OUTPUT_DIR / "playwright_dynamic.json", rows)
write_csv(OUTPUT_DIR / "playwright_dynamic.csv", rows)
print(f"Playwright dynamic rows extracted: {len(rows)}")
return rows
def flatten_products(rows):
merchandise = []
for row in rows:
if row.get("page_type") == "product" and isinstance(row.get("product"), dict):
product = row["product"]
merchandise.append(
{
"supply": row.get("supply"),
"url": row.get("url"),
"sku": product.get("sku"),
"title": product.get("title"),
"class": product.get("class"),
"worth": product.get("worth"),
"score": product.get("score"),
"inventory": product.get("inventory"),
"options": "; ".be a part of(product.get("options", [])),
}
)
elif row.get("supply") == "parsel-precision":
merchandise.append(
{
"supply": row.get("supply"),
"url": row.get("url"),
"sku": row.get("sku"),
"title": row.get("title"),
"class": row.get("class"),
"worth": row.get("worth"),
"score": row.get("score"),
"inventory": row.get("inventory"),
"options": "; ".be a part of(row.get("options", [])),
}
)
elif row.get("supply") == "playwright-rendered-js":
merchandise.append(
{
"supply": row.get("supply"),
"url": row.get("url"),
"sku": row.get("sku"),
"title": row.get("title"),
"class": "dynamic-js",
"worth": row.get("worth") or money_to_float(row.get("price_text")),
"score": None,
"inventory": row.get("inventory"),
"options": row.get("description"),
}
)
return merchandise
def absolute_url(base_url, href):
if not href:
return None
if href.startswith("http://") or href.startswith("https://"):
return href
if href.startswith("/"):
return base_url + href
return base_url + "/" + href
def build_link_graph(base_url, rows):
graph = nx.DiGraph()
for row in rows:
src = row.get("url")
if not src:
proceed
graph.add_node(
src,
title=row.get("title", ""),
page_type=row.get("page_type", ""),
)
for hyperlink in row.get("out_links", []) or []:
dst = absolute_url(base_url, hyperlink.get("href"))
if not dst:
proceed
if "/admin/" in dst:
proceed
graph.add_node(dst)
graph.add_edge(src, dst, label=hyperlink.get("label", ""))
return graph
We deal with dynamic content material utilizing PlaywrightCrawler, which opens the JavaScript-rendered web page in a headless Chromium browser. We wait for client-side product playing cards to look, extract their rendered fields, seize a full-page screenshot, and save the browser-based outcomes for later evaluation. We then outline helper capabilities to normalize product information and construct a directed hyperlink graph from the inner hyperlinks found throughout crawling.
Building AI-Ready Outputs and Running the Pipeline
def make_rag_chunks(rows, max_chars=700):
chunks = []
for row in rows:
textual content = (
row.get("text_preview")
or row.get("rendered_text")
or row.get("description")
or ""
)
textual content = normalize_text(textual content)
if not textual content:
proceed
sentences = re.break up(r"(?<=[.!?])s+", textual content)
present = ""
for sentence in sentences:
if len(present) + len(sentence) + 1 <= max_chars:
present = (present + " " + sentence).strip()
else:
if present:
chunks.append(
{
"chunk_id": hashlib.sha1(
(row.get("url", "") + present).encode()
).hexdigest()[:12],
"url": row.get("url"),
"supply": row.get("supply"),
"page_type": row.get("page_type"),
"title": row.get("title") or row.get("title"),
"textual content": present,
}
)
present = sentence
if present:
chunks.append(
{
"chunk_id": hashlib.sha1(
(row.get("url", "") + present).encode()
).hexdigest()[:12],
"url": row.get("url"),
"supply": row.get("supply"),
"page_type": row.get("page_type"),
"title": row.get("title") or row.get("title"),
"textual content": present,
}
)
return chunks
def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):
all_rows = bs4_rows + parsel_rows + playwright_rows
merchandise = flatten_products(all_rows)
crawl_df = pd.DataBody(all_rows)
product_df = pd.DataBody(merchandise)
if not product_df.empty:
product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")
product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")
product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")
product_df["inventory_value"] = product_df["price"] * product_df["stock"]
graph = build_link_graph(base_url, bs4_rows)
graph_path = OUTPUT_DIR / "site_link_graph.graphml"
if graph.number_of_nodes() > 0:
nx.write_graphml(graph, graph_path)
chunks = make_rag_chunks(all_rows)
rag_path = OUTPUT_DIR / "rag_chunks.jsonl"
with rag_path.open("w", encoding="utf-8") as f:
for chunk in chunks:
f.write(json.dumps(chunk, ensure_ascii=False) + "n")
crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"
crawl_json_path.write_text(
json.dumps(all_rows, ensure_ascii=False, indent=2),
encoding="utf-8",
)
product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"
if not product_df.empty:
product_df.to_csv(product_csv_path, index=False)
price_plot_path = OUTPUT_DIR / "product_price_chart.png"
if not product_df.empty and product_df["price"].notna().any():
plot_df = product_df.dropna(subset=["price"]).copy()
plot_df["label"] = plot_df["sku"].fillna("unknown") + "n" + plot_df["source"].fillna("")
ax = plot_df.plot(
variety="bar",
x="label",
y="worth",
legend=False,
figsize=(11, 5),
title="Extracted Product Prices by Source",
)
ax.set_xlabel("Product / extraction supply")
ax.set_ylabel("Price")
plt.xticks(rotation=35, ha="proper")
plt.tight_layout()
plt.savefig(price_plot_path, dpi=160)
plt.present()
graph_stats = {
"nodes": graph.number_of_nodes(),
"edges": graph.number_of_edges(),
"weakly_connected_components": (
nx.number_weakly_connected_components(graph)
if graph.number_of_nodes()
else 0
),
}
if graph.number_of_nodes() > 0:
in_degrees = dict(graph.in_degree())
out_degrees = dict(graph.out_degree())
graph_stats["top_in_degree"] = sorted(
in_degrees.objects(),
key=lambda x: x[1],
reverse=True,
)[:5]
graph_stats["top_out_degree"] = sorted(
out_degrees.objects(),
key=lambda x: x[1],
reverse=True,
)[:5]
abstract = {
"base_url": base_url,
"rows_total": len(all_rows),
"beautifulsoup_rows": len(bs4_rows),
"parsel_rows": len(parsel_rows),
"playwright_rows": len(playwright_rows),
"products_total": len(product_df),
"rag_chunks_total": len(chunks),
"graph": graph_stats,
"outputs": {
"beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),
"beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),
"parsel_json": str(OUTPUT_DIR / "parsel_products.json"),
"parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),
"playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),
"playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),
"combined_json": str(crawl_json_path),
"product_csv": str(product_csv_path) if product_csv_path.exists() else None,
"rag_jsonl": str(rag_path),
"graphml": str(graph_path) if graph_path.exists() else None,
"price_plot": str(price_plot_path) if price_plot_path.exists() else None,
"screenshots_dir": str(SCREENSHOT_DIR),
},
}
summary_path = OUTPUT_DIR / "run_summary.md"
summary_path.write_text(
"# Crawlee Python Advanced Tutorial Run Summarynn"
f"- Local demo website: `{base_url}`n"
f"- Total extracted rows: `{abstract['rows_total']}`n"
f"- BeautifulSoup rows: `{abstract['beautifulsoup_rows']}`n"
f"- Parsel rows: `{abstract['parsel_rows']}`n"
f"- Playwright rows: `{abstract['playwright_rows']}`n"
f"- Normalized merchandise: `{abstract['products_total']}`n"
f"- RAG chunks: `{abstract['rag_chunks_total']}`n"
f"- Link graph nodes: `{graph_stats['nodes']}`n"
f"- Link graph edges: `{graph_stats['edges']}`nn"
"## Output filesnn"
+ "n".be a part of(f"- `{ok}`: `{v}`" for ok, v in abstract["outputs"].objects())
+ "n",
encoding="utf-8",
)
print("n=== 4) Analysis abstract ===")
print(json.dumps(abstract, indent=2, ensure_ascii=False))
strive:
from IPython.show import show, Markdown, Image as IPImage
show(Markdown("## Crawlee crawl preview"))
if not crawl_df.empty:
preview_cols = [
col for col in ["source", "page_type", "title", "url"]
if col in crawl_df.columns
]
show(crawl_df[preview_cols].head(12))
show(Markdown("## Normalized product catalog"))
if not product_df.empty:
show(product_df.head(20))
if price_plot_path.exists():
show(Markdown("## Product worth chart"))
show(IPImage(filename=str(price_plot_path)))
screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
if screenshot_path.exists():
show(Markdown("## Playwright screenshot of JavaScript-rendered web page"))
show(IPImage(filename=str(screenshot_path)))
show(Markdown(f"## Output directoryn`{OUTPUT_DIR}`"))
besides Exception as exc:
print("Notebook show skipped:", repr(exc))
return abstract
async def primary():
httpd, base_url = start_local_server(SITE_DIR)
print(f"nLocal demo web site is working at: {base_url}/index.html")
strive:
bs4_rows = await run_beautifulsoup_crawl(base_url)
parsel_rows = await run_parsel_precision_crawl(base_url)
playwright_rows = await run_playwright_dynamic_crawl(base_url)
abstract = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)
return abstract
lastly:
httpd.shutdown()
print("nLocal demo server shut down.")
loop = asyncio.get_event_loop()
abstract = loop.run_until_complete(primary())
print("nTutorial full.")
print(f"All outputs are in: {OUTPUT_DIR}")
print("Key recordsdata:")
for file_path in sorted(OUTPUT_DIR.rglob("*")):
if file_path.is_file():
print(" -", file_path)
We course of the extracted crawl information into analysis-ready and AI-ready outputs. We create RAG-style JSONL chunks, mix all crawl outcomes, construct a normalized product catalog, generate a GraphML hyperlink graph, and visualize product costs with Matplotlib. Finally, we run the complete pipeline end-to-end, show previews within the pocket book, save all generated artifacts, and print the ultimate output file paths.
Conclusion
In conclusion, we’ve a full Crawlee-based pipeline for crawling and information engineering that converts a small web site into structured, reusable datasets. We used crawl scoping, robots.txt dealing with, concurrency settings, hyperlink enqueuing, browser rendering, key-value storage, and dataset exports to simulate patterns utilized in manufacturing net crawling programs. We normalized the extracted product information, saved the crawl outputs as JSON and CSV, created GraphML hyperlink graphs with NetworkX, generated JSONL chunks for retrieval-augmented technology workflows, and visualized the extracted product costs with Matplotlib.
Check out the Full Codes here. Also, be at liberty to comply with us on Twitter and don’t overlook to hitch our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export appeared first on MarkTechPost.
