Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

In this tutorial, we work with NVIDIA’s Nemotron-Pretraining-Code-v3 dataset as a large-scale metadata index for code pretraining analysis. Instead of downloading the complete multi-gigabyte dataset, we stream it, examine its schema, and construct a manageable pattern for evaluation. We then discover the dataset by finding out languages, file extensions, repository frequency, and listing depth, which helps us perceive how the index is structured. After that, we reconstruct the uncooked GitHub URLs from the metadata, try to fetch the precise supply recordsdata, and estimate the token scale of the fetched code. By the top of the workflow, we create a reusable filtered pattern and save processed outputs for additional experimentation.

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

Copy Code

!pip -q set up -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null
import os, io, time, itertools, collections, textwrap, math
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datasets import load_dataset, get_dataset_config_names
REPO_ID = "nvidia/Nemotron-Pretraining-Code-v3"
pd.set_option("show.max_colwidth", 80)
configs = get_dataset_config_names(REPO_ID)
CONFIG = configs[0]
print(f"Configs obtainable : {configs}")
print(f"Using config      : {CONFIG}")
stream = load_dataset(REPO_ID, CONFIG, cut up="practice", streaming=True)
print("nFeatures / schema:")
print(stream.options)
print("nFirst uncooked file:")
print(subsequent(iter(stream)))

We arrange the Colab surroundings by putting in the required libraries and importing the instruments wanted for dataset streaming, evaluation, and visualization. We outline the NVIDIA Nemotron-Pretraining-Code-v3 dataset ID, uncover the obtainable dataset configuration, and load the coaching cut up in streaming mode. We additionally examine the dataset schema and print the primary file to know the construction earlier than conducting deeper evaluation.

Building a Shuffled Sample and Analyzing Code Metadata Features

Copy Code

N_SAMPLE = 30_000
shuffled = stream.shuffle(seed=42, buffer_size=20_000)
t0 = time.time()
rows = listing(itertools.islice(shuffled, N_SAMPLE))
df = pd.DataBody(rows)
print(f"nPulled {len(df):,} rows in {time.time()-t0:,.1f}s")
print(df.head(10))
print("nColumns:", listing(df.columns), "| reminiscence:",
     f"{df.memory_usage(deep=True).sum()/1e6:,.1f} MB")
df["ext"]   = df["rel_path"].str.extract(r".([A-Za-z0-9_]+)$")[0].str.decrease()
df["depth"] = df["rel_path"].str.depend("/")
df["fname"] = df["rel_path"].str.rsplit("/", n=1).str[-1]
print("n--- Top 15 languages (pattern) ---")
lang_counts = df["language"].value_counts()
print(lang_counts.head(15))
print("n--- Top 15 file extensions (pattern) ---")
print(df["ext"].value_counts().head(15))
print("n--- Most frequent repositories (pattern) ---")
print(df["repo"].value_counts().head(10))
print("n--- Path-depth abstract ---")
print(df["depth"].describe())
print(f"nUnique repos in pattern : {df['repo'].nunique():,}")
print(f"Unique languages       : {df['language'].nunique():,}")

We create a shuffled pattern from the streamed dataset in order that we don’t rely solely on the primary clustered rows. We convert the sampled information into a Pandas DataBody and derive helpful options reminiscent of file extension, path depth, and file title. We then study the commonest languages, file extensions, repositories, and path-depth statistics to higher perceive the sampled metadata.

Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency

Copy Code

fig, ax = plt.subplots(2, 2, figsize=(14, 9))
lang_counts.head(12).iloc[::-1].plot.barh(ax=ax[0, 0], shade="#76b900")
ax[0, 0].set_title("Top 12 languages (pattern)"); ax[0, 0].set_xlabel("recordsdata")
df["ext"].value_counts().head(12).iloc[::-1].plot.barh(ax=ax[0, 1], shade="#5b8def")
ax[0, 1].set_title("Top 12 file extensions (pattern)"); ax[0, 1].set_xlabel("recordsdata")
df["depth"].clip(higher=12).plot.hist(bins=vary(0, 14), ax=ax[1, 0],
                                    shade="#f4a261", edgecolor="white")
ax[1, 0].set_title("Directory nesting depth"); ax[1, 0].set_xlabel("'/' depend in path")
(df["repo"].value_counts().head(10).iloc[::-1]
  .plot.barh(ax=ax[1, 1], shade="#9b5de5"))
ax[1, 1].set_title("Most widespread repos (pattern)"); ax[1, 1].set_xlabel("recordsdata")
plt.tight_layout(); plt.present()

We visualize the primary patterns discovered within the sampled metadata utilizing a number of plots. We examine the highest languages, high file extensions, listing nesting depth, and most frequent repositories within the pattern. We use these charts to make the dataset simpler to interpret and to shortly establish dominant buildings contained in the metadata index.

Reconstructing Raw GitHub URLs and Fetching Real Source Files

Copy Code

def raw_url(repo: str, commit_id: str, rel_path: str) -> str:
   from urllib.parse import quote
   return (f"https://uncooked.githubusercontent.com/{repo}/{commit_id}/"
           f"{quote(rel_path)}")
df["raw_url"] = df.apply(lambda r: raw_url(r.repo, r.commit_id, r.rel_path), axis=1)
print("nExample reconstructed URLs:")
for u in df["raw_url"].head(5):
   print(" ", u)
def fetch_code(url: str, max_bytes: int = 200_000, timeout: int = 10):
   strive:
       resp = requests.get(url, timeout=timeout)
       if resp.status_code == 200 and len(resp.content material) <= max_bytes:
           return resp.textual content
       return None
   besides requests.RequestException:
       return None
print("n--- Attempting to fetch a few actual recordsdata ---")
fetched, makes an attempt = [], 0
for _, r in df.pattern(frac=1, random_state=1).iterrows():
   if len(fetched) >= 5:
       break
   makes an attempt += 1
   code = fetch_code(r["raw_url"])
   standing = "OK " if code else "MISS"
   print(f"[{status}] {r['language']:<12} {r['repo']}/{r['rel_path']}")
   if code:
       fetched.append({**r.to_dict(), "code": code, "n_chars": len(code)})
print(f"nFetched {len(fetched)} recordsdata in {makes an attempt} makes an attempt "
     f"(misses are regular — repos get deleted/renamed).")
if fetched:
   ex = fetched[0]
   print(f"n----- PREVIEW: {ex['repo']}/{ex['rel_path']} ({ex['language']}) -----")
   print(textwrap.shorten(ex["code"].exchange("n", "  "), width=600,
                          placeholder=" ...[truncated]"))

We reconstruct uncooked GitHub URLs from the metadata: the repository title, commit ID, and relative file path. We then try to fetch a few actual supply recordsdata from GitHub, gracefully dealing with lacking, deleted, personal, or outsized recordsdata. We preview one efficiently fetched file to see how the metadata index connects again to the precise code content material.

Filtering Python Files, Estimating Token Scale, and Saving Outputs

Copy Code

TARGET_LANG = "Python"
py_index = df[df["language"] == TARGET_LANG].copy()
print(f"n{TARGET_LANG} recordsdata in pattern: {len(py_index):,}")
strive:
   import tiktoken
   enc = tiktoken.get_encoding("cl100k_base")
   tok = lambda s: len(enc.encode(s, disallowed_special=()))
besides Exception:
   tok = lambda s: max(1, len(s) // 4)
if fetched:
   toks = [tok(f["code"]) for f in fetched]
   print(f"Fetched-file tokens: whole={sum(toks):,}  "
         f"imply={sum(toks)/len(toks):,.0f}/file")
TOTAL_FILES, TOTAL_TOKENS = 146_323_609, 173e9
print(f"nFull-dataset scale (per NVIDIA card): "
     f"{TOTAL_FILES:,} recordsdata ≈ {TOTAL_TOKENS/1e9:.0f}B tokens "
     f"(~{TOTAL_TOKENS/TOTAL_FILES:,.0f} tokens/file).")
df.to_parquet("nemotron_code_v3_sample.parquet", index=False)
if fetched:
   pd.DataBody(fetched).to_json("nemotron_fetched_code.jsonl",
                                 orient="information", traces=True)
print("nSaved: nemotron_code_v3_sample.parquet"
     + (", nemotron_fetched_code.jsonl" if fetched else ""))
print("Done ")

We filter the sampled index for Python recordsdata and estimate token counts for efficiently fetched recordsdata. We use tiktoken when obtainable and fall again on a easy character-based estimate when it isn’t. Also, we save the processed metadata pattern and the fetched code outputs so we will reuse them later with out having to stream the dataset once more.

Conclusion

In conclusion, we constructed a sensible end-to-end workflow to know and use the Nemotron-Pretraining-Code-v3 metadata index. We realized tips on how to stream the dataset effectively, convert a pattern into a DataBody, carry out exploratory evaluation, visualize essential patterns, and reconstruct GitHub file URLs from repository paths and commit identifiers. We additionally demonstrated how metadata might be traced again to the supply code and how token estimation gives a sense of dataset scale.

Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The put up Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken appeared first on MarkTechPost.

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

Building a Shuffled Sample and Analyzing Code Metadata Features

Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency

Reconstructing Raw GitHub URLs and Fetching Real Source Files

Filtering Python Files, Estimating Token Scale, and Saving Outputs

Conclusion

A New AI Research from Anthropic and Thinking Machines Lab Stress Tests Model Specs and Reveal Character Differences among Language Models

TabArena: Benchmarking Tabular Machine Learning with Reproducibility and Ensembling at Scale

Anthropic’s billion-dollar TPU expansion signals a strategic shift in enterprise AI infrastructure

Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation

OpenAI and Oracle announce Stargate AI data centre deal

Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

Building a Shuffled Sample and Analyzing Code Metadata Features

Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency

Reconstructing Raw GitHub URLs and Fetching Real Source Files

Filtering Python Files, Estimating Token Scale, and Saving Outputs

Conclusion

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!