Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken
In this tutorial, we work with NVIDIA’s Nemotron-Pretraining-Code-v3 dataset as a large-scale metadata index for code pretraining analysis. Instead of downloading the complete multi-gigabyte dataset, we stream it, examine its schema, and construct a manageable pattern for evaluation. We then discover the dataset by finding out languages, file extensions, repository frequency, and listing depth, which helps us perceive how the index is structured. After that, we reconstruct the uncooked GitHub URLs from the metadata, try to fetch the precise supply recordsdata, and estimate the token scale of the fetched code. By the top of the workflow, we create a reusable filtered pattern and save processed outputs for additional experimentation.
Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema
!pip -q set up -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null
import os, io, time, itertools, collections, textwrap, math
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datasets import load_dataset, get_dataset_config_names
REPO_ID = "nvidia/Nemotron-Pretraining-Code-v3"
pd.set_option("show.max_colwidth", 80)
configs = get_dataset_config_names(REPO_ID)
CONFIG = configs[0]
print(f"Configs obtainable : {configs}")
print(f"Using config : {CONFIG}")
stream = load_dataset(REPO_ID, CONFIG, cut up="practice", streaming=True)
print("nFeatures / schema:")
print(stream.options)
print("nFirst uncooked file:")
print(subsequent(iter(stream)))
We arrange the Colab surroundings by putting in the required libraries and importing the instruments wanted for dataset streaming, evaluation, and visualization. We outline the NVIDIA Nemotron-Pretraining-Code-v3 dataset ID, uncover the obtainable dataset configuration, and load the coaching cut up in streaming mode. We additionally examine the dataset schema and print the primary file to know the construction earlier than conducting deeper evaluation.
Building a Shuffled Sample and Analyzing Code Metadata Features
N_SAMPLE = 30_000
shuffled = stream.shuffle(seed=42, buffer_size=20_000)
t0 = time.time()
rows = listing(itertools.islice(shuffled, N_SAMPLE))
df = pd.DataBody(rows)
print(f"nPulled {len(df):,} rows in {time.time()-t0:,.1f}s")
print(df.head(10))
print("nColumns:", listing(df.columns), "| reminiscence:",
f"{df.memory_usage(deep=True).sum()/1e6:,.1f} MB")
df["ext"] = df["rel_path"].str.extract(r".([A-Za-z0-9_]+)$")[0].str.decrease()
df["depth"] = df["rel_path"].str.depend("/")
df["fname"] = df["rel_path"].str.rsplit("/", n=1).str[-1]
print("n--- Top 15 languages (pattern) ---")
lang_counts = df["language"].value_counts()
print(lang_counts.head(15))
print("n--- Top 15 file extensions (pattern) ---")
print(df["ext"].value_counts().head(15))
print("n--- Most frequent repositories (pattern) ---")
print(df["repo"].value_counts().head(10))
print("n--- Path-depth abstract ---")
print(df["depth"].describe())
print(f"nUnique repos in pattern : {df['repo'].nunique():,}")
print(f"Unique languages : {df['language'].nunique():,}")
We create a shuffled pattern from the streamed dataset in order that we don’t rely solely on the primary clustered rows. We convert the sampled information into a Pandas DataBody and derive helpful options reminiscent of file extension, path depth, and file title. We then study the commonest languages, file extensions, repositories, and path-depth statistics to higher perceive the sampled metadata.
Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency
fig, ax = plt.subplots(2, 2, figsize=(14, 9))
lang_counts.head(12).iloc[::-1].plot.barh(ax=ax[0, 0], shade="#76b900")
ax[0, 0].set_title("Top 12 languages (pattern)"); ax[0, 0].set_xlabel("recordsdata")
df["ext"].value_counts().head(12).iloc[::-1].plot.barh(ax=ax[0, 1], shade="#5b8def")
ax[0, 1].set_title("Top 12 file extensions (pattern)"); ax[0, 1].set_xlabel("recordsdata")
df["depth"].clip(higher=12).plot.hist(bins=vary(0, 14), ax=ax[1, 0],
shade="#f4a261", edgecolor="white")
ax[1, 0].set_title("Directory nesting depth"); ax[1, 0].set_xlabel("'/' depend in path")
(df["repo"].value_counts().head(10).iloc[::-1]
.plot.barh(ax=ax[1, 1], shade="#9b5de5"))
ax[1, 1].set_title("Most widespread repos (pattern)"); ax[1, 1].set_xlabel("recordsdata")
plt.tight_layout(); plt.present()
We visualize the primary patterns discovered within the sampled metadata utilizing a number of plots. We examine the highest languages, high file extensions, listing nesting depth, and most frequent repositories within the pattern. We use these charts to make the dataset simpler to interpret and to shortly establish dominant buildings contained in the metadata index.
Reconstructing Raw GitHub URLs and Fetching Real Source Files
def raw_url(repo: str, commit_id: str, rel_path: str) -> str:
from urllib.parse import quote
return (f"https://uncooked.githubusercontent.com/{repo}/{commit_id}/"
f"{quote(rel_path)}")
df["raw_url"] = df.apply(lambda r: raw_url(r.repo, r.commit_id, r.rel_path), axis=1)
print("nExample reconstructed URLs:")
for u in df["raw_url"].head(5):
print(" ", u)
def fetch_code(url: str, max_bytes: int = 200_000, timeout: int = 10):
strive:
resp = requests.get(url, timeout=timeout)
if resp.status_code == 200 and len(resp.content material) <= max_bytes:
return resp.textual content
return None
besides requests.RequestException:
return None
print("n--- Attempting to fetch a few actual recordsdata ---")
fetched, makes an attempt = [], 0
for _, r in df.pattern(frac=1, random_state=1).iterrows():
if len(fetched) >= 5:
break
makes an attempt += 1
code = fetch_code(r["raw_url"])
standing = "OK " if code else "MISS"
print(f"[{status}] {r['language']:<12} {r['repo']}/{r['rel_path']}")
if code:
fetched.append({**r.to_dict(), "code": code, "n_chars": len(code)})
print(f"nFetched {len(fetched)} recordsdata in {makes an attempt} makes an attempt "
f"(misses are regular — repos get deleted/renamed).")
if fetched:
ex = fetched[0]
print(f"n----- PREVIEW: {ex['repo']}/{ex['rel_path']} ({ex['language']}) -----")
print(textwrap.shorten(ex["code"].exchange("n", " "), width=600,
placeholder=" ...[truncated]"))
We reconstruct uncooked GitHub URLs from the metadata: the repository title, commit ID, and relative file path. We then try to fetch a few actual supply recordsdata from GitHub, gracefully dealing with lacking, deleted, personal, or outsized recordsdata. We preview one efficiently fetched file to see how the metadata index connects again to the precise code content material.
Filtering Python Files, Estimating Token Scale, and Saving Outputs
TARGET_LANG = "Python"
py_index = df[df["language"] == TARGET_LANG].copy()
print(f"n{TARGET_LANG} recordsdata in pattern: {len(py_index):,}")
strive:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tok = lambda s: len(enc.encode(s, disallowed_special=()))
besides Exception:
tok = lambda s: max(1, len(s) // 4)
if fetched:
toks = [tok(f["code"]) for f in fetched]
print(f"Fetched-file tokens: whole={sum(toks):,} "
f"imply={sum(toks)/len(toks):,.0f}/file")
TOTAL_FILES, TOTAL_TOKENS = 146_323_609, 173e9
print(f"nFull-dataset scale (per NVIDIA card): "
f"{TOTAL_FILES:,} recordsdata ≈ {TOTAL_TOKENS/1e9:.0f}B tokens "
f"(~{TOTAL_TOKENS/TOTAL_FILES:,.0f} tokens/file).")
df.to_parquet("nemotron_code_v3_sample.parquet", index=False)
if fetched:
pd.DataBody(fetched).to_json("nemotron_fetched_code.jsonl",
orient="information", traces=True)
print("nSaved: nemotron_code_v3_sample.parquet"
+ (", nemotron_fetched_code.jsonl" if fetched else ""))
print("Done
")
We filter the sampled index for Python recordsdata and estimate token counts for efficiently fetched recordsdata. We use tiktoken when obtainable and fall again on a easy character-based estimate when it isn’t. Also, we save the processed metadata pattern and the fetched code outputs so we will reuse them later with out having to stream the dataset once more.
Conclusion
In conclusion, we constructed a sensible end-to-end workflow to know and use the Nemotron-Pretraining-Code-v3 metadata index. We realized tips on how to stream the dataset effectively, convert a pattern into a DataBody, carry out exploratory evaluation, visualize essential patterns, and reconstruct GitHub file URLs from repository paths and commit identifiers. We additionally demonstrated how metadata might be traced again to the supply code and how token estimation gives a sense of dataset scale.
Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The put up Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken appeared first on MarkTechPost.
