How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python
In this tutorial, we discover AgentTrove, one of many largest open-source collections of agentic interplay traces, and find out how to work with it effectively. Instead of downloading the complete dataset, we use streaming to examine rows, detect the dialog schema, normalize agent turns, and perceive how person, assistant, system, and device messages are structured. We additionally construct utilities to parse command-style assistant outputs, render full trajectories in a readable format, and research how brokers work together with instruments throughout totally different duties. Also, we create a light-weight analytical workflow that samples 1000’s of traces, converts them into a DataFrame, summarizes turn-level statistics, visualizes essential dataset patterns, and exports profitable traces into a clear ShareGPT-style JSONL format for supervised fine-tuning.
!pip -q set up "datasets>=2.19" pandas matplotlib pyarrow huggingface_hub
import itertools, json, collections, textwrap, re, random, statistics
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
REPO = "open-thoughts/AgentTrove"
random.seed(0)
print("
Imports prepared. Target dataset:", REPO)
ds = load_dataset(REPO, break up="prepare", streaming=True)
print("
Streaming dataset opened.")
first = subsequent(iter(ds))
print("n
Columns current in a row:")
for ok in first.keys():
v = first[k]
t = kind(v).__name__
preview = (str(v)[:70] + "…") if v shouldn't be None and len(str(v)) > 70 else v
print(f" • {ok:<18} ({t}): {preview}")
We set up the required libraries and import the core instruments wanted for streaming, evaluation, and visualization. We outline the AgentTrove repository, open the dataset in streaming mode, and keep away from downloading the complete dataset domestically. We then examine the primary row to perceive the obtainable columns and get an preliminary view of the dataset schema.
def find_trace_key(row):
for cand in ("conversations", "messages"):
if cand in row and isinstance(row[cand], record):
return cand
for ok, v in row.gadgets():
if isinstance(v, record) and v and isinstance(v[0], dict) and
("content material" in v[0] or "function" in v[0] or "worth" in v[0]):
return ok
increase KeyError("No conversation-like column discovered.")
TRACE_KEY = find_trace_key(first)
print(f"n
Trace column detected: '{TRACE_KEY}'")
def normalize_turns(hint):
turns = []
for flip in hint:
if not isinstance(flip, dict):
turns.append(("unknown", str(flip)))
proceed
function = flip.get("function") or flip.get("from") or "unknown"
content material = flip.get("content material")
if content material is None:
content material = flip.get("worth", "")
turns.append((str(function), "" if content material is None else str(content material)))
return turns
sample_turns = normalize_turns(first[TRACE_KEY])
print(f"
First hint has {len(sample_turns)} turns. "
f"Roles: {collections.Counter(r for r, _ in sample_turns)}")
We create a defensive perform to mechanically detect the column that accommodates the dialog or hint information. We then normalize every flip into a constant role-content format in order that totally different dataset schemas could be dealt with easily. We additionally examine the primary trajectory to rely the variety of turns and perceive the roles current in the dialog.
def extract_commands(assistant_content):
"""Best-effort: pull shell instructions out of an assistant JSON flip."""
cmds = []
txt = re.sub(r"```(?:json)?|```", "", assistant_content).strip()
attempt:
obj = json.hundreds(txt)
besides Exception:
return cmds
def stroll(o):
if isinstance(o, dict):
for key in ("instructions", "command", "keystrokes", "cmd", "motion"):
if key in o:
val = o[key]
if isinstance(val, str):
cmds.append(val.strip())
elif isinstance(val, record):
for merchandise in val:
if isinstance(merchandise, str):
cmds.append(merchandise.strip())
elif isinstance(merchandise, dict):
stroll(merchandise)
for v in o.values():
if isinstance(v, (dict, record)):
stroll(v)
elif isinstance(o, record):
for v in o:
stroll(v)
stroll(obj)
return [c for c in cmds if c]
We outline a command-extraction utility that reads assistant responses and makes an attempt to parse shell instructions from JSON-style outputs. We clear doable code fences, load the content material as JSON, and recursively search by way of widespread command-related fields. This helps us determine tool-like actions inside agent trajectories and measure how typically brokers concern executable instructions.
def render_trace(row, max_chars=600):
meta = {ok: row.get(ok) for ok in
("original_source", "original_teacher", "mannequin", "process",
"consequence", "reward", "model_provider") if ok in row}
print("=" * 78)
print("
METADATA:", {ok: v for ok, v in meta.gadgets() if v shouldn't be None})
print("=" * 78)
for i, (function, content material) in enumerate(normalize_turns(row[TRACE_KEY])):
tag = {"system": "
SYSTEM", "person": "
USER",
"assistant": "
ASSISTANT", "device": "
TOOL"}.get(function, f"
{function.higher()}")
snippet = content material if len(content material) <= max_chars else content material[:max_chars] + " …[truncated]"
print(f"n[{i}] {tag}")
print(textwrap.indent(snippet, " "))
if function == "assistant":
for c in extract_commands(content material)[:5]:
print(f" └─
parsed command: {c!r}")
print("=" * 78, "n")
print("n
EXAMPLE TRAJECTORY (first row):")
render_trace(first, max_chars=400)
We construct a trace-rendering perform that prints the metadata and the complete dialog trajectory in a readable format. We label every flip by function, truncate lengthy messages for readability, and present parsed instructions beneath assistant messages.
N = 2000
data = []
print(f"n
Streaming {N} rows for evaluation…")
for row in itertools.islice(load_dataset(REPO, break up="prepare", streaming=True), N):
turns = normalize_turns(row[TRACE_KEY])
roles = collections.Counter(r for r, _ in turns)
total_chars = sum(len(c) for _, c in turns)
asst_cmds = sum(len(extract_commands(c)) for r, c in turns if r == "assistant")
data.append({
"original_source": row.get("original_source"),
"original_teacher": row.get("original_teacher"),
"mannequin": row.get("mannequin"),
"model_provider": row.get("model_provider"),
"consequence": row.get("consequence"),
"reward": row.get("reward"),
"n_turns": len(turns),
"n_user": roles.get("person", 0),
"n_assistant": roles.get("assistant", 0),
"n_tool": roles.get("device", 0),
"total_chars": total_chars,
"n_commands": asst_cmds,
})
df = pd.DataFrame(data)
print(f"
Built DataFrame: {df.form[0]} rows × {df.form[1]} cols")
print("n
Numeric abstract (turns / size / instructions):")
print(df[["n_turns", "n_assistant", "n_tool", "total_chars", "n_commands"]]
.describe().spherical(1).to_string())
def show_dist(col, prime=15):
if col in df and df[col].notna().any():
print(f"n
Top values for '{col}':")
print(df[col].value_counts(dropna=True).head(prime).to_string())
else:
print(f"n
'{col}' is empty/absent in this pattern.")
for c in ("original_source", "original_teacher", "mannequin", "model_provider", "consequence"):
show_dist(c)
We stream a pattern of rows from AgentTrove and gather helpful statistics, together with flip counts, device utilization, whole characters, and parsed command counts. We retailer these light-weight options in a pandas DataFrame to make the dataset simpler to summarize and analyze. We additionally print distribution tables for fields akin to supply, instructor mannequin, mannequin supplier, and consequence to perceive the place the traces originate.
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
src = df["original_source"].value_counts().head(10)
axes[0, 0].barh(src.index[::-1], src.values[::-1], shade="#4C72B0")
axes[0, 0].set_title("Top 10 Task Sources"); axes[0, 0].set_xlabel("traces")
tch = df["original_teacher"].value_counts().head(10)
axes[0, 1].barh(tch.index[::-1], tch.values[::-1], shade="#55A868")
axes[0, 1].set_title("Teacher Models"); axes[0, 1].set_xlabel("traces")
axes[1, 0].hist(df["n_turns"].clip(higher=df["n_turns"].quantile(0.97)),
bins=30, shade="#C44E52", edgecolor="white")
axes[1, 0].set_title("Turns per Trajectory (97th-pct clipped)")
axes[1, 0].set_xlabel("turns"); axes[1, 0].set_ylabel("rely")
axes[1, 1].scatter(df["n_assistant"], df["n_commands"], alpha=0.3, s=12, shade="#8172B2")
axes[1, 1].set_title("Assistant Turns vs. Parsed Commands")
axes[1, 1].set_xlabel("assistant turns"); axes[1, 1].set_ylabel("shell instructions extracted")
plt.tight_layout(); plt.present()
We create 4 visualizations to discover the sampled traces from totally different angles. We plot the highest process sources, instructor fashions, turn-count distribution, and the connection between assistant turns and parsed instructions. These charts assist us shortly determine patterns in the dataset and perceive how agent conduct varies throughout sources and duties.
def is_success(row):
res = (row.get("consequence") or "").decrease()
if res in ("resolved", "success", "cross", "handed", "right"):
return True
rw = row.get("reward")
attempt:
return float(rw) >= 1.0
besides (TypeError, ValueError):
return False
out_path = "agenttrove_clean_sft.jsonl"
saved, scanned, SCAN, KEEP = 0, 0, 1500, 200
print(f"n
Scanning up to {SCAN} rows, maintaining to {KEEP} profitable traces…")
with open(out_path, "w") as f:
for row in itertools.islice(load_dataset(REPO, break up="prepare", streaming=True), SCAN):
scanned += 1
if not is_success(row):
proceed
turns = normalize_turns(row[TRACE_KEY])
conv = [{"from": r, "value": c} for r, c in turns if c.strip()]
if len(conv) < 2:
proceed
f.write(json.dumps({
"conversations": conv,
"supply": row.get("original_source"),
"instructor": row.get("original_teacher"),
}) + "n")
saved += 1
if saved >= KEEP:
break
print(f"
Scanned {scanned} rows → wrote {saved} clear traces to '{out_path}'")
def search_traces(key phrase=None, supply=None, restrict=3, scan=3000):
"""Stream the dataset and yield-print traces matching filters."""
hits = 0
for row in itertools.islice(load_dataset(REPO, break up="prepare", streaming=True), scan):
if supply and row.get("original_source") != supply:
proceed
if key phrase:
blob = " ".be part of(c for _, c in normalize_turns(row[TRACE_KEY]))
if key phrase.decrease() not in blob.decrease():
proceed
render_trace(row, max_chars=300)
hits += 1
if hits >= restrict:
break
if hits == 0:
print("No matches in the scanned window — attempt rising `scan`.")
print("n
Searching for 'nl2bash' supply traces:")
search_traces(supply="nl2bash", restrict=2, scan=4000)
print("n
Tutorial full! Next concepts:")
print(" • Increase N / SCAN for larger analyses.")
print(" • Filter by original_source (swesmith, codeforces, r2egym…) for a area SFT set.")
print(" • Feed agenttrove_clean_sft.jsonl into Axolotl / LLaMA-Factory for fine-tuning.")
We outline a success filter that retains traces marked as resolved, handed, right, or positively rewarded. We then export profitable trajectories into a clear ShareGPT-style JSONL file for downstream fine-tuning workflows. Also, we add a search utility to discover traces by key phrase or supply, making the dataset simpler to probe for particular agentic duties.
In conclusion, we constructed a full, hands-on pipeline to examine, analyze, filter, and export information from AgentTrove in a Colab-friendly approach. We began with streaming entry, then progressively added schema detection, flip normalization, command extraction, trajectory rendering, statistical evaluation, visualization, success-based filtering, and key phrase or source-based search. This workflow helps us perceive the interior construction of agentic traces and offers us a reusable basis for making ready high-quality subsets for fine-tuning or analysis. We additionally maintain the method scalable by avoiding full dataset downloads and utilizing streamed samples solely when wanted. Also, we demonstrated how AgentTrove can be utilized as greater than a static dataset: we handled it as a wealthy supply of agent conduct, device utilization, process outcomes, and training-ready conversations that may assist future experiments in agent studying, workflow evaluation, and domain-specific SFT dataset creation.
Check out the Full Codes with Notebook. Also, be happy to comply with us on Twitter and don’t neglect to be part of our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up How to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python appeared first on MarkTechPost.
