|

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

💭

In this tutorial, we discover the lambda/hermes-agent-reasoning-traces dataset to perceive how agent-based fashions assume, use instruments, and generate responses throughout multi-turn conversations. We begin by loading and inspecting the dataset, inspecting its construction, classes, and conversational format to get a transparent thought of the obtainable data. We then construct easy parsers to extract key parts equivalent to reasoning traces, instrument calls, and instrument responses, permitting us to separate inside considering from exterior actions. Also, we analyze patterns equivalent to instrument utilization frequency, dialog size, and error charges to higher perceive agent conduct. We additionally create visualizations to spotlight these developments and make the evaluation extra intuitive. Finally, we put together the dataset for coaching by changing it right into a model-friendly format, making it appropriate for duties like supervised fine-tuning.

!pip -q set up -U datasets pandas matplotlib seaborn transformers speed up trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, break up="practice")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", break up="practice")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", break up="practice")
   ds_kimi = ds_kimi.add_column("supply", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("supply", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


pattern = ds[0]
print("n=== Sample 0 ===")
print("id        :", pattern["id"])
print("class  :", pattern["category"], "/", pattern["subcategory"])
print("activity      :", pattern["task"])
print("turns     :", len(pattern["conversations"]))
print("system[0] :", pattern["conversations"][0]["value"][:220], "...n")

We set up all required libraries and import the vital modules to arrange the environment. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its construction, fields, and classes. We additionally optionally mix a number of dataset configurations and study a pattern to perceive the conversational format.

THINK_RE     = re.compile(r"<assume>(.*?)</assume>", re.DOTALL)
TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.DOTALL)
TOOL_RESP_RE = re.compile(r"<tool_response>s*(.*?)s*</tool_response>", re.DOTALL)


def parse_assistant(worth: str) -> dict:
   ideas = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for uncooked in TOOL_CALL_RE.findall(worth):
       attempt:
           calls.append(json.masses(uncooked))
       besides json.JSONDecodeError:
           calls.append({"title": "<malformed>", "arguments": {}})
   last = TOOL_CALL_RE.sub("", THINK_RE.sub("", worth)).strip()
   return {"ideas": ideas, "tool_calls": calls, "last": last}


def parse_tool(worth: str):
   uncooked = TOOL_RESP_RE.search(worth)
   if not uncooked: return {"uncooked": worth}
   physique = uncooked.group(1)
   attempt:    return json.masses(physique)
   besides: return {"uncooked": physique}


first_gpt = subsequent(t for t in pattern["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We outline regex-based parsers to extract reasoning traces, instrument calls, and instrument responses from the dataset. We course of assistant messages to separate ideas, actions, and last outputs in a structured manner. We then check the parser on a pattern dialog to confirm that the extraction works accurately.

N = 3000
sub = ds.choose(vary(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "<unknown>")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "instrument":
           r = parse_tool(t["value"])
           blob = json.dumps(r).decrease()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.imply(turns_per_traj):.1f}")
print(f"Avg instrument calls/traj : {np.imply(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.imply([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for okay,v in parallel_widths.objects() if okay>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 instruments        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


high = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], shade="teal")
axes[0,0].set_title("Top 15 instruments by name quantity")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for okay in ks], shade="coral")
axes[0,1].set_title("Tool-calls per assistant flip (parallel width)")
axes[0,1].set_xlabel("# instrument calls in a single flip"); axes[0,1].set_ylabel("depend")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, shade="steelblue")
axes[1,0].set_title("Conversation size"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.present()

We carry out dataset-wide analytics to measure instrument utilization, dialog lengths, and error patterns. We mixture statistics throughout a number of samples to perceive total agent conduct. We additionally create visualizations to spotlight developments equivalent to instrument frequency, parallel calls, and class distribution.

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       function = t["from"]
       if function == "system":
           proceed
       if function == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
       elif function == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('title')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
       elif function == "instrument":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   attempt:    return json.masses(ex["tools"])
   besides: return []


schemas = get_tool_schemas(pattern)
print(f"nSample 0 has {len(schemas)} instruments obtainable")
for s in schemas[:3]:
   fn = s.get("perform", {})
   print(" -", fn.get("title"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "person", "gpt": "assistant", "instrument": "instrument"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content material": t["value"]} for t in conv]


example_msgs = to_openai_messages(pattern["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].substitute("n", " "), "...")

We construct utilities to render full dialog traces in a readable format for deeper inspection. We additionally extract instrument schemas and convert the dataset into OpenAI-style message format for compatibility with coaching pipelines. This helps us higher perceive each the construction of instruments and how conversations could be standardized.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "instrument":
           m["role"] = "person"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       textual content = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(textual content, add_special_tokens=False)
       input_ids.prolong(ids)
       labels.prolong(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(pattern["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized instance: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.choose(vary(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": proceed
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.determine(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["<think>", "<tool_call>", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.present()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"assume": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "instrument" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"⚙  {c.get('title')}({json.dumps(c.get('arguments', {}))[:140]})")
       for r in s["responses"]:
           print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
       if s["think"]["final"]:
           print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(pattern)
for i in vary(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.choose(vary(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "instrument":
               m["role"] = "person"; m["content"] = "[TOOL]n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   mannequin = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="textual content",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(mannequin=mannequin, args=cfg, train_dataset=train_subset, processing_class=tok).practice()
   print("Fine-tune demo completed.")


print("n✅ Tutorial full. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an non-compulsory coaching hook.")

We tokenize the conversations and apply label masking so solely assistant responses contribute to coaching. We analyze the size distributions of reasoning, instrument calls, and solutions to achieve additional insights. We additionally implement a hint replayer to step via agent conduct and optionally run a small fine-tuning loop.

In conclusion, we developed a structured workflow to parse, analyze, and work successfully with agent reasoning traces. We had been in a position to break down conversations into significant parts, study how brokers purpose step-by-step, and measure how they work together with instruments throughout drawback fixing. Using the visualizations and analytics, we gained insights into widespread patterns and behaviors throughout the dataset. In addition, we transformed the information right into a format appropriate for coaching language fashions, together with dealing with tokenization and label masking for assistant responses. Also, this course of offers a robust basis for finding out, evaluating, and bettering tool-using AI programs in a sensible, scalable manner.


Check out the Full Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t neglect to be a part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset appeared first on MarkTechPost.

Similar Posts