Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics
In this tutorial, we discover the Open-SWE-Traces dataset as a sensible useful resource for finding out and making ready agentic software-engineering trajectories for fine-tuning. We stream the dataset straight from Hugging Face, so we will work with a big dataset effectively in Google Colab with out downloading all the pieces regionally. We examine particular person information, normalize multi-turn agent conversations, parse ultimate code patches, extract helpful metadata, and construct an evaluation DataBody to grasp trajectory size, instrument utilization, patch measurement, language distribution, and decision outcomes. We then use these insights to create a curated supervised fine-tuning subset that retains solely high-quality trajectories primarily based on success labels, token limits, language filters, and patch availability.
Installing Dependencies and Configuration
import subprocess, sys
def _pip(*pkgs):
subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], verify=False)
_pip("-U", "datasets", "huggingface_hub")
_pip("tiktoken", "pandas", "matplotlib")
import json
import re
import textwrap
from itertools import islice
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
pd.set_option("show.max_columns", 50)
pd.set_option("show.width", 160)
plt.rcParams.replace({
"determine.figsize": (9, 4.6),
"determine.dpi": 110,
"axes.grid": True,
"grid.alpha": 0.25,
"axes.spines.high": False,
"axes.spines.proper": False,
"font.measurement": 11,
"axes.titlesize": 13,
"axes.titleweight": "daring",
})
BLUE, ORANGE, GREEN, RED = "#4C72B0", "#DD8452", "#55A868", "#C44E52"
def banner(title):
line = "=" * 78
print(f"n{line}n {title}n{line}")
DATASET = "nvidia/Open-SWE-Traces"
AGENTS = ["openhands", "sweagent"]
MODELS = ["minimax_m25", "qwen35_122b"]
SAMPLE_ALL = True
PER_COMBO = 400
N_SINGLE = 1500
MAX_SFT_TOKENS = 32000
SFT_REQUIRE_RESOLVED = True
SFT_LANGUAGES = None
We begin by putting in and importing the core libraries wanted for streaming, parsing, evaluation, and visualization. We configure pandas and matplotlib to make sure our tables and plots stay readable in Google Colab. We additionally outline the dataset identify, agent/mannequin combos, sampling measurement, and SFT filtering settings that management the remainder of the tutorial.
Defining Trajectory Parsing Helpers
def message_text(msg):
if not isinstance(msg, dict):
return ""
content material = msg.get("content material", "")
if content material is None:
return ""
if isinstance(content material, str):
return content material
if isinstance(content material, record):
elements = []
for block in content material:
if isinstance(block, dict):
elements.append(block.get("textual content") or block.get("content material") or "")
elif isinstance(block, str):
elements.append(block)
return "n".be a part of(p for p in elements if p)
return str(content material)
def normalize_trajectory(traj):
if traj is None:
return []
if isinstance(traj, str):
attempt:
traj = json.masses(traj)
besides Exception:
return []
norm = []
for msg in traj:
if isinstance(msg, str):
attempt:
msg = json.masses(msg)
besides Exception:
msg = {"position": "unknown", "content material": msg}
if isinstance(msg, dict):
norm.append(msg)
return norm
def normalize_metadata(meta):
if isinstance(meta, str):
attempt:
return json.masses(meta)
besides Exception:
return {}
return meta if isinstance(meta, dict) else {}
def role_counts(trajectory):
c = Counter()
for msg in trajectory or []:
if isinstance(msg, dict):
c[msg.get("role", "unknown")] += 1
return c
_FUNC_XML = re.compile(r"<features*=s*([a-zA-Z0-9_-]+)", re.IGNORECASE)
_EXEC_TAG = re.compile(r"<(execute_[a-z]+)>", re.IGNORECASE)
_BASH_FENCE = re.compile(r"```(?:bash|sh|shell)b", re.IGNORECASE)
def extract_tool_names(trajectory):
names = Counter()
for msg in trajectory or []:
if not isinstance(msg, dict):
proceed
for name in msg.get("tool_calls") or []:
fn = (name or {}).get("perform", {}) if isinstance(name, dict) else {}
if fn.get("identify"):
names[fn["name"]] += 1
if msg.get("position") == "instrument" and msg.get("identify"):
names[msg["name"]] += 1
if msg.get("position") == "assistant":
textual content = message_text(msg)
for m in _FUNC_XML.findall(textual content):
names[m.lower()] += 1
for m in _EXEC_TAG.findall(textual content):
names[m.lower()] += 1
if _BASH_FENCE.search(textual content):
names["bash_block"] += 1
return names
def parse_patch(diff_text):
if not diff_text or not isinstance(diff_text, str):
return 0, 0, 0, [], Counter()
information, exts = [], Counter()
additions = deletions = 0
for line in diff_text.splitlines():
if line.startswith("diff --git"):
elements = line.break up()
if len(elements) >= 3:
path = elements[2][2:] if elements[2].startswith("a/") else elements[2]
information.append(path)
base = path.break up("/")[-1]
if "." in base:
exts[base.rsplit(".", 1)[-1].decrease()] += 1
elif line.startswith("+") and not line.startswith("+++"):
additions += 1
elif line.startswith("-") and not line.startswith("---"):
deletions += 1
return len(information), additions, deletions, information, exts
def make_token_counter():
attempt:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
return lambda s: len(enc.encode(s, disallowed_special=()))
besides Exception:
return lambda s: max(1, len(s) // 4)
count_tokens = make_token_counter()
We outline helper features that make the dataset simpler to course of, even when fields seem in numerous codecs. We normalize trajectories, extract message textual content, depend roles, detect instrument utilization, parse code patches, and estimate token lengths. We construct these utilities defensively in order that our evaluation stays secure throughout schema variations in massive streamed datasets.
Streaming and Inspecting Trajectories
def stream_take(agent, mannequin, n):
ds = load_dataset(DATASET, agent, break up=mannequin, streaming=True)
rows = []
for ex in islice(ds, n):
ex = dict(ex)
ex["_agent"], ex["_model"] = agent, mannequin
rows.append(ex)
return rows
banner("STEP 1 — Streaming trajectories from the Hub")
raw_rows = []
if SAMPLE_ALL:
combos = [(a, m) for a in AGENTS for m in MODELS]
for agent, mannequin in combos:
attempt:
half = stream_take(agent, mannequin, PER_COMBO)
raw_rows.prolong(half)
print(f" ✓ {agent:<10} / {mannequin:<12} -> {len(half):>4} rows")
besides Exception as e:
print(f" ✗ {agent}/{mannequin} failed: {kind(e).__name__}: {e}")
else:
raw_rows = stream_take(AGENTS[0], MODELS[0], N_SINGLE)
print(f" ✓ {AGENTS[0]} / {MODELS[0]} -> {len(raw_rows)} rows")
print(f"n Total rows pulled into reminiscence: {len(raw_rows)}")
assert raw_rows, "No rows streamed — verify your web connection and retry."
banner("STEP 2 — Anatomy of a single report")
pattern = raw_rows[0]
print("Top-level fields :", record(pattern.keys()))
print("instance_id :", pattern.get("instance_id"))
print("repo / language :", pattern.get("repo"), "/", pattern.get("language"))
print("license :", pattern.get("license"))
print("resolved (1/0/-1):", pattern.get("resolved"))
print("metadata :", normalize_metadata(pattern.get("metadata")))
traj0 = normalize_trajectory(pattern.get("trajectory"))
print(f"nTrajectory has {len(traj0)} messages. Role histogram: {dict(role_counts(traj0))}")
print("n--- Trajectory walkthrough (every message truncated to 240 chars) ---")
for i, msg in enumerate(traj0[:8]):
position = msg.get("position", "unknown").higher()
physique = " ".be a part of(message_text(msg).break up())
print(f"n[{i}] {position}")
print(textwrap.fill(physique[:240] + ("…" if len(physique) > 240 else ""),
width=92, subsequent_indent=" "))
if len(traj0) > 8:
print(f"n… (+{len(traj0) - 8} extra messages)")
print("n--- Final patch (model_patch), first 25 strains ---")
print("n".be a part of((pattern.get("model_patch") or "").splitlines()[:25]) or "(empty)")
We stream a small pattern of Open-SWE-Traces straight from Hugging Face as a substitute of downloading the complete dataset. We acquire examples throughout agent and mannequin combos, then examine the construction of a single report intimately. We stroll by means of the primary few trajectory messages and preview the ultimate patch to grasp what every coaching instance accommodates.
Building the Analysis DataBody
banner("STEP 3 — Building the evaluation DataBody")
def process_example(ex):
traj = normalize_trajectory(ex.get("trajectory"))
rc = role_counts(traj)
nf, add, dele, _files, _exts = parse_patch(ex.get("model_patch"))
meta = normalize_metadata(ex.get("metadata"))
full_text = "n".be a part of(message_text(m) for m in traj)
return {
"instance_id": ex.get("instance_id"),
"repo": ex.get("repo"),
"language": (ex.get("language") or "unknown").decrease(),
"license": ex.get("license"),
"resolved": ex.get("resolved"),
"agent": ex.get("_agent"),
"mannequin": ex.get("_model"),
"n_messages": len(traj),
"n_system": rc.get("system", 0),
"n_user": rc.get("consumer", 0),
"n_assistant": rc.get("assistant", 0),
"n_tool": rc.get("instrument", 0),
"patch_files": nf,
"patch_add": add,
"patch_del": dele,
"patch_churn": add + dele,
"traj_tokens": count_tokens(full_text),
"class": meta.get("class"),
"meta_files": meta.get("num_modified_files"),
"meta_lines": meta.get("num_modified_lines"),
"_tools": extract_tool_names(traj),
}
information = [process_example(ex) for ex in raw_rows]
df = pd.DataBody(information)
df["is_resolved"] = (df["resolved"] == 1)
df["known_label"] = df["resolved"].isin([0, 1])
print(f"DataBody: {df.form[0]} rows x {df.form[1]} cols")
print("nNumeric abstract:")
print(df[["n_messages", "n_assistant", "n_tool",
"patch_files", "patch_churn", "traj_tokens"]].describe().spherical(1))
We remodel the uncooked streamed information right into a structured pandas DataBody for evaluation. We extract trajectory-level options reminiscent of message counts, position counts, patch churn, token estimates, metadata fields, and tool-use counters. We additionally create decision flags to check profitable and unsuccessful software-engineering trajectories.
Visualizing Trajectory Distributions
banner("STEP 4 — Distributions & visualizations")
lang_counts = df["language"].value_counts()
print("Trajectories per language:n", lang_counts.to_string())
ax = lang_counts.plot(sort="bar", colour=BLUE)
ax.set_title("Trajectories per language (pattern)")
ax.set_xlabel(""); ax.set_ylabel("depend")
plt.tight_layout(); plt.present()
recognized = df[df["known_label"]]
by_lang = (recognized.groupby("language")["is_resolved"]
.agg(charge="imply", n="measurement")
.question("n >= 25")
.sort_values("charge", ascending=False))
print("nResolution charge by language (n>=25):n", by_lang.spherical(3).to_string())
if not by_lang.empty:
ax = by_lang["rate"].plot(sort="bar", colour=GREEN)
ax.set_title("Resolution charge by language")
ax.set_xlabel(""); ax.set_ylabel("fraction resolved"); ax.set_ylim(0, 1)
plt.tight_layout(); plt.present()
if recognized["agent"].nunique() > 1 or recognized["model"].nunique() > 1:
pivot = (recognized.groupby(["agent", "model"])["is_resolved"].imply().unstack())
print("nResolution charge by scaffold x mannequin:n", pivot.spherical(3).to_string())
ax = pivot.plot(sort="bar", colour=[BLUE, ORANGE])
ax.set_title("Resolution charge: scaffold x mannequin")
ax.set_xlabel("agent"); ax.set_ylabel("fraction resolved"); ax.set_ylim(0, 1)
ax.legend(title="mannequin"); plt.tight_layout(); plt.present()
ax = df["n_messages"].plot(sort="hist", bins=40, colour=BLUE, alpha=0.85)
ax.set_title("Messages per trajectory")
ax.set_xlabel("variety of messages"); ax.set_ylabel("trajectories")
plt.tight_layout(); plt.present()
churn = df["patch_churn"].clip(higher=df["patch_churn"].quantile(0.97))
ax = churn.plot(sort="hist", bins=40, colour=ORANGE, alpha=0.85)
ax.set_title("Patch measurement — strains modified (clipped at p97)")
ax.set_xlabel("added + deleted strains"); ax.set_ylabel("trajectories")
plt.tight_layout(); plt.present()
if recognized["is_resolved"].nunique() > 1:
fig, ax = plt.subplots()
for flag, colour, lab in [(True, GREEN, "resolved"), (False, RED, "unresolved")]:
sub = recognized[known["is_resolved"] == flag]
ax.scatter(sub["n_messages"], sub["traj_tokens"],
s=10, alpha=0.4, colour=colour, label=lab)
ax.set_title("Trajectory size vs. token measurement, by consequence")
ax.set_xlabel("messages"); ax.set_ylabel("estimated tokens")
ax.legend(); plt.tight_layout(); plt.present()
Analyzing Token Budget Requirements
banner("STEP 5 — Token price range (what context window do you want?)")
tok = df["traj_tokens"]
print("Estimated tokens per trajectory — percentiles:")
for p in [50, 75, 90, 95, 99]:
print(f" p{p:<2}: {int(tok.quantile(p/100)):>8,}")
print(f" max: {int(tok.max()):>8,}")
home windows = [8_192, 16_384, 32_768, 65_536, 131_072]
print("nFraction of trajectories that slot in a given context window:")
for w in home windows:
frac = (tok <= w).imply()
print(f" {w:>7,} tokens : {frac*100:5.1f}%")
ax = tok.clip(higher=tok.quantile(0.99)).plot(sort="hist", bins=50,
colour=BLUE, alpha=0.85)
for w, c in zip([8_192, 32_768, 131_072], [GREEN, ORANGE, RED]):
if w <= tok.quantile(0.99):
ax.axvline(w, colour=c, ls="--", lw=1.5, label=f"{w//1024}ok ctx")
ax.set_title("Trajectory token-length distribution (clipped at p99)")
ax.set_xlabel("estimated tokens"); ax.set_ylabel("trajectories")
ax.legend(); plt.tight_layout(); plt.present()
Measuring Agent Tool Usage
banner("STEP 6 — Which instruments/actions do the brokers use?")
tool_totals = Counter()
for t in df["_tools"]:
tool_totals.replace(t)
top_tools = tool_totals.most_common(12)
if top_tools:
print("Most frequent agent actions (throughout the pattern):")
for identify, cnt in top_tools:
print(f" {identify:<24} {cnt:>7,}")
labels, vals = zip(*top_tools)
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(vary(len(labels)), vals, colour=BLUE)
ax.set_yticks(vary(len(labels))); ax.set_yticklabels(labels)
ax.invert_yaxis()
ax.set_title("Top agent actions / instrument invocations")
ax.set_xlabel("depend"); plt.tight_layout(); plt.present()
else:
print("No instrument actions detected with the present heuristics.")
if recognized["is_resolved"].nunique() > 1:
print("nMean 'instrument' (atmosphere) turns by consequence:")
print(recognized.groupby("is_resolved")["n_tool"].imply().spherical(2).to_string())
We discover the dataset by means of language counts, decision charges, scaffold/mannequin comparisons, message-length distributions, patch-size distributions, and token-budget evaluation. We visualize how trajectory size, token measurement, and instrument utilization differ throughout the sampled information. We use these plots and summaries to find out which examples are sensible to fine-tune beneath totally different context-window limits.
Building a Curated SFT Subset
banner("STEP 7 — Building a curated SFT subset")
def to_chatml(trajectory):
out = []
for m in trajectory:
position = m.get("position", "unknown")
out.append(f"<|im_start|>{position}n{message_text(m).strip()}<|im_end|>")
return "n".be a part of(out)
def passes_filters(rec, uncooked):
if SFT_REQUIRE_RESOLVED and rec["resolved"] != 1:
return False
if rec["traj_tokens"] > MAX_SFT_TOKENS:
return False
if SFT_LANGUAGES just isn't None and rec["language"] not in SFT_LANGUAGES:
return False
if not (uncooked.get("model_patch") or "").strip():
return False
return True
sft_examples = []
for rec, uncooked in zip(information, raw_rows):
if not passes_filters(rec, uncooked):
proceed
messages = [{"role": m.get("role"), "content": message_text(m)}
for m in normalize_trajectory(raw.get("trajectory"))]
sft_examples.append({
"instance_id": rec["instance_id"],
"repo": rec["repo"],
"language": rec["language"],
"agent": rec["agent"],
"mannequin": rec["model"],
"messages": messages,
"textual content": to_chatml(messages),
"model_patch": uncooked.get("model_patch"),
"approx_tokens": rec["traj_tokens"],
})
print(f"Kept {len(sft_examples)} / {len(information)} trajectories after filtering")
print(f" filters -> resolved_only={SFT_REQUIRE_RESOLVED}, "
f"max_tokens={MAX_SFT_TOKENS:,}, languages={SFT_LANGUAGES or 'all'}")
if sft_examples:
saved = pd.DataBody(sft_examples)
print("nCurated subset by language:n", saved["language"].value_counts().to_string())
print("n--- One formatted SFT instance (ChatML, truncated) ---")
print(sft_examples[0]["text"][:600], "…")
banner("STEP 8 — Exporting artifacts")
csv_path = "open_swe_traces_analysis.csv"
df.drop(columns=["_tools"]).to_csv(csv_path, index=False)
print(f" Wrote evaluation desk -> {csv_path} ({len(df)} rows)")
jsonl_path = "open_swe_sft.jsonl"
with open(jsonl_path, "w", encoding="utf-8") as f:
for ex in sft_examples:
f.write(json.dumps(ex, ensure_ascii=False) + "n")
print(f" Wrote SFT dataset -> {jsonl_path} ({len(sft_examples)} rows)")
print("nDone. In Colab, open the Files pane (folder icon, left) to obtain each.")
print("To load the SFT file later: datasets.load_dataset('json', "
"data_files='open_swe_sft.jsonl')")
We convert chosen trajectories into an SFT-ready format utilizing standardized message dictionaries and an elective ChatML-style textual content illustration. We filter examples by decision standing, token price range, language choice, and patch availability to maintain the curated subset helpful for coaching. We lastly export each the evaluation CSV and the JSONL SFT dataset for reuse in later fine-tuning workflows.
Conclusion
In conclusion, we constructed a whole workflow to rework Open-SWE-Traces from a big, uncooked, agentic dataset into structured analytics and SFT-ready coaching information. We realized stream trajectories, examine agent habits, measure token budgets, evaluate scaffolds and fashions, analyze patch traits, and export each an evaluation desk and a JSONL coaching file. We now have a reusable framework that we will prolong for bigger sampling, language-specific fine-tuning, deeper tool-use evaluation, and model-specific chat-template formatting.
Check out the Full Codes here. Also, be happy to observe us on Twitter and don’t overlook to affix our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The submit Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics appeared first on MarkTechPost.
