A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

In this tu t orial, we discover the FineWeb dataset by a sophisticated hands-on workflow. We stream a manageable pattern of the dataset with out downloading the complete multi-terabyte corpus, examine its schema and metadata, and analyze key fields similar to URL, language, language rating, and token rely. We additionally reproduce simplified variations of SuperbWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, confirm token counts with the GPT-2 tokenizer, and generate helpful analytics on domains, language scores, doc lengths, and tokenizer effectivity.

Copy Code

import subprocess, sys
def pip(*pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], verify=True)
pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm")
import re, math, random, collections
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from datasets import load_dataset
random.seed(0); np.random.seed(0)
pd.set_option("show.max_colwidth", 90)

We start by putting in all required libraries for streaming, evaluation, deduplication, tokenization, and visualization. We import the core Python packages wanted to course of SuperbWeb paperwork and work with tabular information. We additionally set random seeds and show choices in order that our outcomes stay constant and simpler to examine.

Copy Code

N_DOCS = 3000
print(f"Streaming {N_DOCS} docs from SuperbWeb sample-10BT ...")
stream = load_dataset(
   "HuggingFaceFW/fineweb",
   identify="sample-10BT",
   cut up="practice",
   streaming=True,
)
docs = []
for i, doc in enumerate(tqdm(stream, complete=N_DOCS)):
   docs.append(doc)
   if i + 1 >= N_DOCS:
       break
df = pd.DataBody(docs)
print("nColumns:", record(df.columns))
print(df[["url", "language", "language_score", "token_count"]].head(5))
ex = docs[0]
print("n--- Example file (fields) ---")
for okay, v in ex.gadgets():
   preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v
   print(f"{okay:>16}: {preview}")

We stream a set variety of paperwork from the SuperbWeb sample-10BT subset with out downloading the complete dataset. We convert the streamed data right into a DataBody and examine key metadata fields, together with URL, language, language rating, and token rely. We additionally print an entire instance file to higher perceive the dataset’s construction.

Copy Code

WORD = re.compile(r"bw+b")
def gopher_quality(textual content):
   phrases = WORD.findall(textual content)
   n = len(phrases)
   if n < 50 or n > 100_000:
       return False, "word_count_out_of_range"
   mean_len = sum(len(w) for w in phrases) / n
   if mean_len < 3 or mean_len > 10:
       return False, "bad_mean_word_length"
   if (textual content.rely("#") + textual content.rely("...")) / n > 0.1:
       return False, "too_many_symbols"
   traces = textual content.cut up("n")
   if traces and sum(l.lstrip().startswith(("•", "-", "*")) for l in traces) / len(traces) > 0.9:
       return False, "mostly_bullets"
   stops = {"the", "be", "to", "of", "and", "that", "have", "with"}
   if len(stops & {w.decrease() for w in phrases}) < 2:
       return False, "too_few_stopwords"
   return True, "okay"
def c4_quality(textual content):
   traces = [l for l in text.split("n") if l.strip()]
   if not traces:
       return False, "empty"
   low = textual content.decrease()
   for dangerous in ("lorem ipsum", "javascript is disabled"):
       if dangerous in low:
           return False, f"boilerplate:{dangerous}"
   if textual content.rely("{") > 0 and textual content.rely("{") / max(len(traces), 1) > 0.5:
       return False, "too_many_braces"
   return True, "okay"
def fineweb_custom(textual content):
   traces = [l.strip() for l in text.split("n") if l.strip()]
   if not traces:
       return False, "empty"
   dup_frac = 1 - len(set(traces)) / len(traces)
   if dup_frac > 0.3:
       return False, "duplicated_lines"
   short_frac = sum(len(l) < 30 for l in traces) / len(traces)
   if short_frac > 0.67 and len(traces) > 5:
       return False, "list_like"
   return True, "okay"
outcomes = []
for d in docs:
   t = d["text"]
   g_ok, g_r = gopher_quality(t)
   c_ok, c_r = c4_quality(t)
   f_ok, f_r = fineweb_custom(t)
   motive = "saved" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r)
   outcomes.append(motive)
filter_summary = pd.Series(outcomes).value_counts()
print("n--- Quality-filter outcomes on already-clean SuperbWeb information ---")
print("(Most cross: SuperbWeb is pre-filtered. Rejections present what the principles catch.)")
print(filter_summary)

We recreate simplified variations of SuperbWeb’s high quality filters utilizing Gopher-style, C4-style, and customized text-cleaning heuristics. We verify every doc for points similar to irregular phrase counts, poor phrase statistics, boilerplate textual content, repeated traces, and list-like construction. We summarize what number of paperwork cross or fail these filters to know the standard of the already-cleaned SuperbWeb pattern.

Copy Code

from datasketch import MinHash, MinHashLSH
def shingles(textual content, okay=5):
   toks = WORD.findall(textual content.decrease())
   return {" ".be a part of(toks[i:i+k]) for i in vary(max(len(toks) - okay + 1, 1))}
NUM_PERM = 128
THRESHOLD = 0.7
lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)
minhashes = {}
for idx, d in enumerate(tqdm(docs, desc="MinHashing")):
   m = MinHash(num_perm=NUM_PERM)
   for s in shingles(d["text"]):
       m.replace(s.encode("utf8"))
   minhashes[idx] = m
   lsh.insert(str(idx), m)
dup_pairs = set()
for idx, m in minhashes.gadgets():
   for cand in lsh.question(m):
       c = int(cand)
       if c != idx:
           dup_pairs.add(tuple(sorted((idx, c))))
print(f"nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).")
if dup_pairs:
   a, b = subsequent(iter(dup_pairs))
   j = minhashes[a].jaccard(minhashes[b])
   print(f"Example pair (estimated Jaccard ≈ {j:.2f}):")
   print("  DOC A:", docs[a]["text"][:160].exchange("n", " "), "…")
   print("  DOC B:", docs[b]["text"][:160].exchange("n", " "), "…")
else:
   print("No near-dupes on this slice — anticipated, since SuperbWeb is dedup'd per crawl.")

We implement MinHash-based near-duplicate detection to approximate how giant internet corpora establish repeated or extremely related paperwork. We convert every doc into phrase shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate doc pairs and examine an instance if any related texts are discovered.

Copy Code

import tiktoken
enc = tiktoken.get_encoding("gpt2")
verify = docs[:200]
recomputed = [len(enc.encode(d["text"])) for d in tqdm(verify, desc="Tokenizing")]
saved = [d["token_count"] for d in verify]
diffs = np.array(recomputed) - np.array(saved)
print(f"n--- Verifying token_count discipline (gpt2) on 200 docs ---")
print(f"Mean abs diff vs saved token_count: {np.abs(diffs).imply():.2f} tokens")
print(f"Exact matches: {(diffs == 0).imply()*100:.0f}%   (small drift = tokenizer model)")
df["chars_per_token"] = df["text"].str.len() / df["token_count"].clip(decrease=1)
print(f"Avg characters per token: {df['chars_per_token'].imply():.2f}")

We confirm the dataset’s token_count discipline by recomputing GPT-2 token counts with the tiktoken tokenizer. We evaluate the recomputed token counts with the saved values and measure the common distinction between them. We additionally calculate characters per token to know tokenizer effectivity throughout the sampled paperwork.

Copy Code

df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.exchange("www.", "") if isinstance(u, str) else "?")
top_domains = df["domain"].value_counts().head(15)
print("n--- Top 15 domains in pattern ---")
print(top_domains)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes[0, 0].hist(df["token_count"].clip(higher=4000), bins=50, shade="#7b2d26")
axes[0, 0].set_title("Token rely per doc (gpt2)")
axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs")
axes[0, 1].hist(df["language_score"], bins=40, shade="#2d5d7b")
axes[0, 1].axvline(0.65, shade="crimson", ls="--", label="SuperbWeb cutoff 0.65")
axes[0, 1].set_title("fastText English language rating")
axes[0, 1].set_xlabel("rating"); axes[0, 1].legend()
axes[1, 0].hist(df["chars_per_token"].clip(higher=8), bins=40, shade="#3f7b2d")
axes[1, 0].set_title("Characters per token (compression)")
axes[1, 0].set_xlabel("chars / token")
top_domains.iloc[::-1].plot(form="barh", ax=axes[1, 1], shade="#7b5d2d")
axes[1, 1].set_title("Top domains")
plt.tight_layout()
plt.present()
print("n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
print(f"Docs streamed          : {len(df):,}")
print(f"Total gpt2 tokens       : {df['token_count'].sum():,}")
print(f"Median tokens/doc       : {int(df['token_count'].median())}")
print(f"Unique domains          : {df['domain'].nunique():,}")
print(f"Mean language_score     : {df['language_score'].imply():.3f}")
print(f"Near-duplicate pairs    : {len(dup_pairs)}")
print(f"Docs flagged by filters : {(pd.Series(outcomes) != 'saved').sum()} / {len(outcomes)}")
print("nNext steps:")
print("  • Swap identify='sample-10BT' for an actual crawl, e.g. identify='CC-MAIN-2024-10'")
print("  • Raise N_DOCS for stronger statistics")
print("  • Use the complete datatrove pipeline to breed SuperbWeb end-to-end")

We extract domains from URLs and establish essentially the most frequent domains current within the SuperbWeb pattern. We create visualizations for token rely distribution, language rating distribution, characters per token, and prime domains. We end by printing a compact abstract of streamed paperwork, complete tokens, median size, distinctive domains, language high quality, duplicate rely, and filter outcomes.

In conclusion, we developed a sensible understanding of how large-scale internet datasets similar to SuperbWeb are explored, filtered, deduplicated, and analyzed for language mannequin coaching. We labored effectively with streaming information, examined high quality heuristics on actual paperwork, recognized near-duplicate textual content patterns, and validated token-level metadata utilizing a production-style tokenizer. It can be utilized to scale the workflow to bigger SuperbWeb crawls, carry out deeper corpus evaluation, and design high-quality preprocessing pipelines for LLM dataset preparation.

Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics appeared first on MarkTechPost.