Code Execution for Parsing, Analyzing, Visualizing, and Debugging Agent Reasoning Traces using the lambda/hermes-agent-reasoning-traces dataset

In this lesson, we examine the lambda/hermes-agent-reasoning-traces dataset understanding how agent-based models think, use tools, and generate responses across multi-curve conversations. We start by loading and examining the dataset, examining its structure, categories, and dialog format to get a clear view of the available information. We then developed simple parsers to extract important components such as logic traces, tool calls, and tool responses, allowing us to separate internal logic from external actions. Also, we analyze patterns such as tool usage frequency, conversation length, and error rates to better understand agent behavior. We also create visualizations to highlight these trends and simplify analysis. Finally, we prepare the training dataset by converting it to a model-friendly format, making it suitable for tasks such as supervised tuning.

!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


sample = ds[0]
print("n=== Sample 0 ===")
print("id        :", sample["id"])
print("category  :", sample["category"], "/", sample["subcategory"])
print("task      :", sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...n")

We install all the necessary libraries and import the necessary modules to set up our site. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its structure, fields, and categories. We also compile multiple dataset configurations and test a sample to understand the format of the discussion.

THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"s*({.*?})s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"s*(.*?)s*", re.DOTALL)


def parse_assistant(value: str) -> dict:
   thoughts = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append({"name": "", "arguments": {}})
   final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
   return {"thoughts": thoughts, "tool_calls": calls, "final": final}


def parse_tool(value: str):
   raw = TOOL_RESP_RE.search(value)
   if not raw: return {"raw": value}
   body = raw.group(1)
   try:    return json.loads(body)
   except: return {"raw": body}


first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We describe regex-based parsers to extract logic traces, tool calls, and tool responses from a dataset. We process the assistant’s messages to categorize thoughts, actions, and end results in a systematic way. We then test the attacker in a sample dialog to make sure the extraction is working properly.

N = 3000
sub = ds.select(range(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "tool":
           r = parse_tool(t["value"])
           blob = json.dumps(r).lower()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel("# tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We performed extensive analysis of the dataset to measure tool usage, conversation length, and error patterns. We aggregate statistics across multiple samples to understand the agent’s overall behavior. We also create visualizations to highlight trends such as instrument frequency, parallel calls, and phase distribution.

def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       role = t["from"]
       if role == "system":
           continue
       if role == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
       elif role == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('name')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
       elif role == "tool":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return json.loads(ex["tools"])
   except: return []


schemas = get_tool_schemas(sample)
print(f"nSample 0 has {len(schemas)} tools available")
for s in schemas[:3]:
   fn = s.get("function", {})
   print(" -", fn.get("name"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content": t["value"]} for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].replace("n", " "), "...")

We build resources to provide a full conversation trace in a readable format for in-depth analysis. We also extract tool schemas and convert the dataset into an OpenAI-style message format to be compatible with training pipelines. This helps us to better understand both the structure of the tools and how the discussions can be structured.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "tool":
           m["role"] = "user"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
       labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": continue
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"think": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "tool" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"⚙️  {c.get('name')}({json.dumps(c.get('arguments', {}))[:140]})")
       for r in s["responses"]:
           print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
       if s["think"]["final"]:
           print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.select(range(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "tool":
               m["role"] = "user"; m["content"] = "[TOOL]n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   model = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="text",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
   print("Fine-tune demo finished.")


print("n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optional training hook.")

We tokenize the conversations and use label masking so only the assistants’ responses contribute to the training. We analyze the length of the thought distribution, tool calls, and responses to get more information. We also use a trace replayer to step through the behavior of the agent and use a small fine-tuning loop.

In conclusion, we have developed a structured workflow to analyze, analyze, and efficiently process the agent’s thought tracks. We were able to break down conversations into meaningful parts, examine how agents think step by step, and measure how they interact with tools during problem solving. Using visualization and analysis, we found insights into similar patterns and behaviors across the dataset. In addition, we converted the data into a format suitable for training language models, including handling tokens and hiding the labels of the assistant’s responses. Also, this process provides a solid foundation for learning, testing, and developing AI systems that use tools in a realistic, scalable way.

Check it out Full Codes with notebook. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Leave a Comment Cancel reply