|

A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction

📦

In this tutorial, we construct a whole and sensible Crawl4AI workflow and discover how fashionable internet crawling goes far past merely downloading web page HTML. We arrange the complete atmosphere, configure browser habits, and work by way of important capabilities corresponding to fundamental crawling, markdown technology, structured CSS-based extraction, JavaScript execution, session dealing with, screenshots, hyperlink evaluation, concurrent crawling, and deep multi-page exploration. We additionally look at how Crawl4AI will be prolonged with LLM-based extraction to remodel uncooked internet content material into structured, usable knowledge. Throughout the tutorial, we concentrate on hands-on implementation to know the key options of Crawl4AI v0.8.x and learn to apply them to practical knowledge extraction and internet automation duties.

import subprocess
import sys


print("📦 Installing system dependencies...")
subprocess.run(['apt-get', 'update', '-qq'], capture_output=True)
subprocess.run(['apt-get', 'install', '-y', '-qq',
               'libnss3', 'libnspr4', 'libatk1.0-0', 'libatk-bridge2.0-0',
               'libcups2', 'libdrm2', 'libxkbcommon0', 'libxcomposite1',
               'libxdamage1', 'libxfixes3', 'libxrandr2', 'libgbm1',
               'libasound2', 'libpango-1.0-0', 'libcairo2'], capture_output=True)
print("✅ System dependencies put in!")


print("n📦 Installing Python packages...")
subprocess.run([sys.executable, '-m', 'pip', 'install', '-U', 'crawl4ai', 'nest_asyncio', 'pydantic', '-q'])
print("✅ Python packages put in!")


print("n📦 Installing Playwright browsers (this may occasionally take a minute)...")
subprocess.run([sys.executable, '-m', 'playwright', 'install', 'chromium'], capture_output=True)
subprocess.run([sys.executable, '-m', 'playwright', 'install-deps', 'chromium'], capture_output=True)
print("✅ Playwright browsers put in!")


import nest_asyncio
nest_asyncio.apply()


import asyncio
import json
from typing import List, Optional
from pydantic import BaseModel, Field


print("n" + "="*60)
print("✅ INSTALLATION COMPLETE! Ready to crawl!")
print("="*60)


print("n" + "="*60)
print("📖 PART 2: BASIC CRAWLING")
print("="*60)


from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode


async def basic_crawl():
   """The easiest potential crawl - fetch a webpage and get markdown."""
   print("n🔍 Running fundamental crawl on instance.com...")
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(url="https://instance.com")
      
       print(f"n✅ Crawl profitable: {consequence.success}")
       print(f"📄 Title: {consequence.metadata.get('title', 'N/A')}")
       print(f"📝 Markdown size: {len(consequence.markdown.raw_markdown)} characters")
       print(f"n--- First 500 chars of markdown ---")
       print(consequence.markdown.raw_markdown[:500])
      
   return consequence


consequence = asyncio.run(basic_crawl())


print("n" + "="*60)
print("⚙ PART 3: CONFIGURED CRAWLING")
print("="*60)


async def configured_crawl():
   """Crawling with customized browser and crawler configurations."""
   print("n🔧 Running configured crawl with customized settings...")
  
   browser_config = BrowserConfig(
       headless=True,
       verbose=True,
       viewport_width=1920,
       viewport_height=1080,
       user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebPackage/537.36"
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       word_count_threshold=10,
       page_timeout=30000,
       wait_until="networkidle",
       verbose=True
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       consequence = await crawler.arun(
           url="https://httpbin.org/html",
           config=run_config
       )
      
       print(f"n✅ Success: {consequence.success}")
       print(f"📊 Status code: {consequence.status_code}")
       print(f"n--- Content Preview ---")
       print(consequence.markdown.raw_markdown[:400])
      
   return consequence


consequence = asyncio.run(configured_crawl())


print("n" + "="*60)
print("📝 PART 4: MARKDOWN GENERATION")
print("="*60)


from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator


async def markdown_generation_demo():
   """Demonstrates uncooked vs match markdown with content material filtering."""
   print("n🎯 Demonstrating markdown technology methods...")
  
   browser_config = BrowserConfig(headless=True, verbose=False)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=PruningContentFilter(
               threshold=0.4,
               threshold_type="mounted",
               min_word_threshold=20
           )
       )
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       consequence = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       raw_len = len(consequence.markdown.raw_markdown)
       fit_len = len(consequence.markdown.fit_markdown) if consequence.markdown.fit_markdown else 0
      
       print(f"n📊 Markdown Comparison:")
       print(f"   Raw Markdown:  {raw_len:,} characters")
       print(f"   Fit Markdown:  {fit_len:,} characters")
       print(f"   Reduction:     {((raw_len - fit_len) / raw_len * 100):.1f}%")
      
       print(f"n--- Fit Markdown Preview (first 600 chars) ---")
       print(consequence.markdown.fit_markdown[:600] if consequence.markdown.fit_markdown else "N/A")
      
   return consequence


consequence = asyncio.run(markdown_generation_demo())

We put together the entire Google Colab atmosphere required to run Crawl4AI easily, together with system packages, Python dependencies, and the Playwright browser setup. We initialize the async-friendly pocket book workflow with nest_asyncio, import the core libraries, and verify that the atmosphere is prepared for crawling duties. We then start with foundational examples: a easy crawl, adopted by a extra configurable crawl that demonstrates how browser settings and runtime choices have an effect on web page retrieval.

print("n" + "="*60)
print("🔎 PART 5: BM25 QUERY-BASED FILTERING")
print("="*60)


async def bm25_filtering_demo():
   """Using BM25 algorithm to extract content material related to a particular question."""
   print("n🎯 Extracting content material related to a particular question...")
  
   question = "authorized points privateness knowledge safety"
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=BM25ContentFilter(
               user_query=question,
               bm25_threshold=1.2
           )
       )
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n📝 Query: '{question}'")
       print(f"📊 Fit markdown size: {len(consequence.markdown.fit_markdown or '')} chars")
       print(f"n--- Query-Relevant Content Preview ---")
       print(consequence.markdown.fit_markdown[:800] if consequence.markdown.fit_markdown else "No related content material discovered")
      
   return consequence


consequence = asyncio.run(bm25_filtering_demo())


print("n" + "="*60)
print("🏗 PART 6: CSS-BASED EXTRACTION (No LLM)")
print("="*60)


from crawl4ai import JsonCssExtractionTechnique


async def css_extraction_demo():
   """Extract structured knowledge utilizing CSS selectors - quick and dependable."""
   print("n🔧 Extracting knowledge utilizing CSS selectors...")
  
   schema = {
       "identify": "Wikipedia Headings",
       "baseSelector": "div.mw-parser-output h2",
       "fields": [
           {
               "name": "heading_text",
               "selector": "span.mw-headline",
               "type": "text"
           },
           {
               "name": "heading_id",
               "selector": "span.mw-headline",
               "type": "attribute",
               "attribute": "id"
           }
       ]
   }
  
   extraction_strategy = JsonCssExtractionTechnique(schema, verbose=False)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=extraction_strategy
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Python_(programming_language)",
           config=run_config
       )
      
       if consequence.extracted_content:
           knowledge = json.hundreds(consequence.extracted_content)
           print(f"n✅ Extracted {len(knowledge)} part headings")
           print(f"n--- Extracted Headings ---")
           for merchandise in knowledge[:10]:
               heading = merchandise.get('heading_text', 'N/A')
               heading_id = merchandise.get('heading_id', 'N/A')
               if heading:
                   print(f"  • {heading} (#{heading_id})")
       else:
           print("❌ No knowledge extracted")
          
   return consequence


consequence = asyncio.run(css_extraction_demo())


print("n" + "="*60)
print("🛒 PART 7: ADVANCED CSS EXTRACTION - Hacker News")
print("="*60)


async def advanced_css_extraction():
   """Extract tales from Hacker News with nested selectors."""
   print("n🛍 Extracting tales from Hacker News...")
  
   schema = {
       "identify": "Hacker News Stories",
       "baseSelector": "tr.athing",
       "fields": [
           {
               "name": "rank",
               "selector": "span.rank",
               "type": "text"
           },
           {
               "name": "title",
               "selector": "span.titleline > a",
               "type": "text"
           },
           {
               "name": "url",
               "selector": "span.titleline > a",
               "type": "attribute",
               "attribute": "href"
           },
           {
               "name": "site",
               "selector": "span.sitestr",
               "type": "text"
           }
       ]
   }
  
   extraction_strategy = JsonCssExtractionTechnique(schema)
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=extraction_strategy
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://information.ycombinator.com",
           config=run_config
       )
      
       if consequence.extracted_content:
           tales = json.hundreds(consequence.extracted_content)
           print(f"n✅ Extracted {len(tales)} tales from Hacker News")
           print(f"n--- Top 10 Stories ---")
           for story in tales[:10]:
               rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'
               title = story.get('title', 'N/A')[:55]
               web site = story.get('web site', 'N/A')
               print(f"  #{rank:<3} {title:<55} ({web site})")
              
   return consequence


consequence = asyncio.run(advanced_css_extraction())

We concentrate on enhancing the standard and relevance of extracted content material by exploring markdown technology and query-aware filtering. We examine uncooked markdown with match markdown to see how pruning reduces noise, and we use BM25-based filtering to maintain solely the components of a web page that align with a particular question. We then transfer into CSS-based extraction, the place we outline a structured schema and use selectors to tug clear heading knowledge from a Wikipedia web page with out counting on an LLM.

print("n" + "="*60)
print("⚡ PART 8: JAVASCRIPT EXECUTION")
print("="*60)


async def javascript_execution_demo():
   """Execute JavaScript on pages earlier than extraction."""
   print("n🎭 Executing JavaScript earlier than crawling...")
  
   js_code = """
   // Scroll right down to set off lazy loading
   window.scrollTo(0, doc.physique.scrollHeight);
  
   // Wait for content material to load
   await new Promise(r => setTimeout(r, 1000));
  
   // Scroll again up
   window.scrollTo(0, 0);
  
   // Add a marker to confirm JS ran
   doc.physique.setAttribute('data-crawl4ai', 'executed');
   """
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       js_code=[js_code],
       wait_for="css:physique",
       delay_before_return_html=1.0
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://httpbin.org/html",
           config=run_config
       )
      
       print(f"n✅ Page crawled with JS execution")
       print(f"📊 Status: {consequence.status_code}")
       print(f"📝 Content size: {len(consequence.markdown.raw_markdown)} chars")
      
   return consequence


consequence = asyncio.run(javascript_execution_demo())


print("n" + "="*60)
print("🤖 PART 9: LLM-BASED EXTRACTION")
print("="*60)


from crawl4ai import LLMExtractionTechnique, LLMConfig


class Article(BaseModel):
   title: str = Field(description="The article title")
   abstract: str = Field(description="A transient abstract")
   subjects: List[str] = Field(description="Main subjects coated")


async def llm_extraction_demo():
   """Use LLM to intelligently extract and construction knowledge."""
   print("n🤖 LLM-based extraction setup...")
  
   import os
   api_key = os.getenv('OPENAI_API_KEY')
  
   if not api_key:
       print("n⚠ No OPENAI_API_KEY discovered. Showing setup code solely.")
       print("nTo allow LLM extraction, run:")
       print("   import os")
       print("   os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'")
       print("n--- Example Code ---")
       example_code = '''
from crawl4ai import LLMExtractionTechnique, LLMConfig
from pydantic import BaseModel, Field


class Product(BaseModel):
   identify: str = Field(description="Product identify")
   value: str = Field(description="Product value")


llm_strategy = LLMExtractionTechnique(
   llm_config=LLMConfig(
       supplier="openai/gpt-4o-mini",  # or "ollama/llama3"
       api_token=os.getenv('OPENAI_API_KEY')
   ),
   schema=Product.model_json_schema(),
   extraction_type="schema",
   instruction="Extract all merchandise with costs."
)


run_config = CrawlerRunConfig(
   extraction_strategy=llm_strategy,
   cache_mode=CacheMode.BYPASS
)


async with AsyncWebCrawler() as crawler:
   consequence = await crawler.arun(url="https://instance.com", config=run_config)
   merchandise = json.hundreds(consequence.extracted_content)
'''
       print(example_code)
       return None
  
   llm_strategy = LLMExtractionTechnique(
       llm_config=LLMConfig(
           supplier="openai/gpt-4o-mini",
           api_token=api_key
       ),
       schema=Article.model_json_schema(),
       extraction_type="schema",
       instruction="Extract article titles and summaries."
   )
  
   run_config = CrawlerRunConfig(
       extraction_strategy=llm_strategy,
       cache_mode=CacheMode.BYPASS
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://information.ycombinator.com",
           config=run_config
       )
      
       if consequence.extracted_content:
           knowledge = json.hundreds(consequence.extracted_content)
           print(f"n✅ LLM extracted:")
           print(json.dumps(knowledge, indent=2)[:1000])
          
   return consequence


consequence = asyncio.run(llm_extraction_demo())

We proceed structured extraction by making use of nested CSS selectors to gather ranked story data from Hacker News in a clear JSON-like format. We then show JavaScript execution earlier than extraction, which helps us work together with dynamic pages by scrolling, ready for content material, and modifying the DOM earlier than processing. Finally, we introduce LLM-based extraction, outline a schema with Pydantic, and present how Crawl4AI can convert unstructured internet content material into structured outputs utilizing a language mannequin.

print("n" + "="*60)
print("🕸 PART 10: DEEP CRAWLING")
print("="*60)


from crawl4ai.deep_crawling import BFSDeepCrawlTechnique
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter, DomainFilter


async def deep_crawl_demo():
   """Crawl a number of pages ranging from a seed URL utilizing BFS."""
   print("n🕷 Starting deep crawl with BFS technique...")
  
   filter_chain = FilterChain([
       DomainFilter(
           allowed_domains=["docs.crawl4ai.com"],
           blocked_domains=[]
       ),
       URLPatternFilter(
           patterns=["*quickstart*", "*installation*", "*examples*"]
       )
   ])
  
   deep_crawl_strategy = BFSDeepCrawlTechnique(
       max_depth=2,
       max_pages=5,
       filter_chain=filter_chain,
       include_external=False
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       deep_crawl_strategy=deep_crawl_strategy
   )
  
   pages_crawled = []
  
   async with AsyncWebCrawler() as crawler:
       outcomes = await crawler.arun(
           url="https://docs.crawl4ai.com/",
           config=run_config
       )
      
       if isinstance(outcomes, listing):
           for lead to outcomes:
               pages_crawled.append(consequence.url)
               print(f"  ✅ Crawled: {consequence.url}")
               print(f"     📄 Content: {len(consequence.markdown.raw_markdown)} chars")
       else:
           pages_crawled.append(outcomes.url)
           print(f"  ✅ Crawled: {outcomes.url}")
           print(f"     📄 Content: {len(outcomes.markdown.raw_markdown)} chars")
  
   print(f"n📊 Total pages crawled: {len(pages_crawled)}")
   return pages_crawled


pages = asyncio.run(deep_crawl_demo())


print("n" + "="*60)
print("🚀 PART 11: MULTI-URL CONCURRENT CRAWLING")
print("="*60)


async def multi_url_crawl():
   """Crawl a number of URLs concurrently for most effectivity."""
   print("n⚡ Crawling a number of URLs concurrently...")
  
   urls = [
       "https://httpbin.org/html",
       "https://httpbin.org/robots.txt",
       "https://httpbin.org/json",
       "https://example.com",
       "https://httpbin.org/headers"
   ]
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       verbose=False
   )
  
   async with AsyncWebCrawler() as crawler:
       outcomes = await crawler.arun_many(
           urls=urls,
           config=run_config
       )
      
       print(f"n📊 Results Summary:")
       print(f"{'URL':<40} {'Status':<10} {'Content':<15}")
       print("-" * 65)
      
       for lead to outcomes:
           url_short = consequence.url[:38] + ".." if len(consequence.url) > 40 else consequence.url
           standing = "✅" if consequence.success else "❌"
           content_len = f"{len(consequence.markdown.raw_markdown):,} chars" if consequence.success else "N/A"
           print(f"{url_short:<40} {standing:<10} {content_len:<15}")
          
   return outcomes


outcomes = asyncio.run(multi_url_crawl())


print("n" + "="*60)
print("📸 PART 12: SCREENSHOTS & MEDIA")
print("="*60)


async def screenshot_demo():
   """Capture screenshots and extract media from pages."""
   print("n📷 Capturing screenshot and extracting media...")
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       screenshot=True,
       pdf=False,
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n✅ Crawl full!")
       print(f"📸 Screenshot captured: {consequence.screenshot is just not None}")
      
       if consequence.screenshot:
           print(f"   Screenshot dimension: {len(consequence.screenshot)} bytes (base64)")
          
       if consequence.media and 'photos' in consequence.media:
           photos = consequence.media['images']
           print(f"n🖼 Found {len(photos)} photos:")
           for img in photos[:5]:
               print(f"   • {img.get('src', 'N/A')[:60]}...")
              
   return consequence


consequence = asyncio.run(screenshot_demo())

We increase from single-page crawling to deeper and broader workflows by introducing BFS-based deep crawling throughout a number of associated pages. We configure a filter chain to manage which domains and URL patterns are allowed, making the crawl focused and environment friendly moderately than uncontrolled. We additionally show concurrent multi-URL crawling and screenshot/media extraction, displaying how Crawl4AI can scale throughout a number of pages whereas additionally accumulating visible and embedded content material.

print("n" + "="*60)
print("🔗 PART 13: LINK EXTRACTION")
print("="*60)


async def link_extraction_demo():
   """Extract and analyze all hyperlinks from a web page."""
   print("n🔗 Extracting and analyzing hyperlinks...")
  
   run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://docs.crawl4ai.com/",
           config=run_config
       )
      
       internal_links = consequence.hyperlinks.get('inside', [])
       external_links = consequence.hyperlinks.get('exterior', [])
      
       print(f"n📊 Link Analysis:")
       print(f"   Internal hyperlinks: {len(internal_links)}")
       print(f"   External hyperlinks: {len(external_links)}")
      
       print(f"n--- Sample Internal Links (first 5) ---")
       for hyperlink in internal_links[:5]:
           print(f"   • {hyperlink.get('href', 'N/A')[:60]}")
          
       print(f"n--- Sample External Links (first 5) ---")
       for hyperlink in external_links[:5]:
           print(f"   • {hyperlink.get('href', 'N/A')[:60]}")
          
   return consequence


consequence = asyncio.run(link_extraction_demo())


print("n" + "="*60)
print("🎯 PART 14: CONTENT SELECTION")
print("="*60)


async def content_selection_demo():
   """Target particular content material utilizing CSS selectors."""
   print("n🎯 Targeting particular content material with CSS selectors...")
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       css_selector="article, essential, .content material, #content material, #mw-content-text",
       excluded_tags=["nav", "footer", "header", "aside"],
       remove_overlay_elements=True
   )
  
   async with AsyncWebCrawler() as crawler:
       consequence = await crawler.arun(
           url="https://en.wikipedia.org/wiki/Web_scraping",
           config=run_config
       )
      
       print(f"n✅ Content extracted with concentrating on")
       print(f"📝 Markdown size: {len(consequence.markdown.raw_markdown):,} chars")
       print(f"n--- Preview (first 500 chars) ---")
       print(consequence.markdown.raw_markdown[:500])
      
   return consequence


consequence = asyncio.run(content_selection_demo())


print("n" + "="*60)
print("🔐 PART 15: SESSION MANAGEMENT")
print("="*60)


async def session_management_demo():
   """Maintain browser classes throughout a number of requests."""
   print("n🔐 Demonstrating session administration...")
  
   browser_config = BrowserConfig(headless=True)
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       session_id = "my_session"
      
       result1 = await crawler.arun(
           url="https://httpbin.org/cookies/set?session=demo123",
           config=CrawlerRunConfig(
               cache_mode=CacheMode.BYPASS,
               session_id=session_id
           )
       )
       print(f"  Step 1: Set cookies - Success: {result1.success}")
      
       result2 = await crawler.arun(
           url="https://httpbin.org/cookies",
           config=CrawlerRunConfig(
               cache_mode=CacheMode.BYPASS,
               session_id=session_id
           )
       )
       print(f"  Step 2: Read cookies - Success: {result2.success}")
       print(f"n📝 Cookie Response:")
       print(result2.markdown.raw_markdown[:300])
      
   return result2


consequence = asyncio.run(session_management_demo())

We analyze the construction and navigability of a web site by extracting each inside and exterior hyperlinks from a web page and summarizing them for inspection. We then show content material concentrating on with CSS selectors and excluded tags, focusing extraction on probably the most significant sections of a web page whereas avoiding navigation or structure noise. After that, we present session administration, the place we protect browser state throughout requests and confirm that cookies persist between sequential crawls.

print("n" + "="*60)
print("🌟 PART 16: COMPLETE REAL-WORLD EXAMPLE")
print("="*60)


async def complete_example():
   """Complete instance combining CSS extraction with content material filtering."""
   print("n🌟 Running full instance: Hacker News scraper with filtering")
  
   schema = {
       "identify": "HN Stories",
       "baseSelector": "tr.athing",
       "fields": [
           {"name": "rank", "selector": "span.rank", "type": "text"},
           {"name": "title", "selector": "span.titleline > a", "type": "text"},
           {"name": "url", "selector": "span.titleline > a", "type": "attribute", "attribute": "href"},
           {"name": "site", "selector": "span.sitestr", "type": "text"}
       ]
   }
  
   browser_config = BrowserConfig(
       headless=True,
       viewport_width=1920,
       viewport_height=1080
   )
  
   run_config = CrawlerRunConfig(
       cache_mode=CacheMode.BYPASS,
       extraction_strategy=JsonCssExtractionTechnique(schema),
       markdown_generator=DefaultMarkdownGenerator(
           content_filter=PruningContentFilter(threshold=0.4)
       )
   )
  
   async with AsyncWebCrawler(config=browser_config) as crawler:
       consequence = await crawler.arun(
           url="https://information.ycombinator.com",
           config=run_config
       )
      
       if consequence.extracted_content:
           tales = json.hundreds(consequence.extracted_content)
          
           print(f"n✅ Successfully extracted {len(tales)} tales!")
           print(f"n{'='*70}")
           print("📰 TOP HACKER NEWS STORIES")
           print("="*70)
          
           for story in tales[:15]:
               rank = story.get('rank', '?').strip('.') if story.get('rank') else '?'
               title = story.get('title', 'No title')[:50]
               web site = story.get('web site', 'N/A')
               url = story.get('url', '')[:30]
               print(f"  #{rank:<3} {title:<50} ({web site})")
              
           print("="*70)
          
           return tales
  
   return []


tales = asyncio.run(complete_example())


print("n" + "="*60)
print("💾 BONUS: SAVING RESULTS")
print("="*60)


if tales:
   with open('hacker_news_stories.json', 'w') as f:
       json.dump(tales, f, indent=2)
   print(f"✅ Saved {len(tales)} tales to 'hacker_news_stories.json'")
   print("nTo obtain in Colab:")
   print("   from google.colab import information")
   print("   information.obtain('hacker_news_stories.json')")


print("n" + "="*60)
print("📚 TUTORIAL COMPLETE!")
print("="*60)


print("""
✅ What you realized:


1. Basic crawling with AsyncWebCrawler
2. Browser & crawler configuration
3. Markdown technology (uncooked vs match)
4. BM25 query-based content material filtering
5. CSS-based structured knowledge extraction
6. Advanced CSS extraction (Hacker News)
7. JavaScript execution for dynamic content material
8. LLM-based extraction setup
9. Deep crawling with BFS technique
10. Multi-URL concurrent crawling
11. Screenshots & media extraction
12. Link extraction & evaluation
13. Content concentrating on with CSS selectors
14. Session administration
15. Complete real-world scraping instance


📖 RESOURCES:
 • Docs: https://docs.crawl4ai.com/
 • GitHub: https://github.com/unclecode/crawl4ai
 • Discord: https://discord.gg/jP8KfhDhyN


🚀 Happy Crawling with Crawl4AI!
""")

We mix a number of concepts from the tutorial into a whole real-world instance that extracts and filters Hacker News tales utilizing structured CSS extraction and Markdown pruning. We format the outcomes right into a readable output, demonstrating how Crawl4AI can help a sensible scraping workflow from assortment to presentation. Finally, we save the extracted tales to a JSON file and shut the tutorial with a transparent abstract of the key ideas and capabilities we’ve got applied all through the pocket book.

In conclusion, we developed a powerful end-to-end understanding of learn how to use Crawl4AI for each easy and superior crawling duties. We moved from easy web page extraction to extra refined workflows involving content material filtering, focused component choice, structured knowledge extraction, dynamic-page interplay, multi-URL concurrency, and deep crawling throughout linked pages. We additionally noticed how the framework helps richer automation by way of media seize, persistent classes, and elective LLM-powered schema extraction. As a consequence, we completed with a sensible basis for constructing dependable, environment friendly, and versatile scraping and crawling pipelines which might be able to help real-world analysis, monitoring, and clever knowledge processing workflows.


Check out the Full Implementation Codes hereAlso, be happy to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction appeared first on MarkTechPost.

Similar Posts