A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation
In this tutorial, we work with Microsoft’s OpenMementos dataset and discover how reasoning traces are structured by blocks and mementos in a sensible, Colab-ready workflow. We stream the dataset effectively, parse its special-token format, examine how reasoning and summaries are organized, and measure the compression supplied by the souvenir illustration throughout totally different domains. As we transfer by the evaluation, we additionally visualize dataset patterns, align the streamed format with the richer full subset, simulate inference-time compression, and put together the info for supervised fine-tuning. In this manner, we construct each an intuitive and technical understanding of how OpenMementos captures long-form reasoning whereas preserving compact summaries that may help environment friendly coaching and inference.
!pip set up -q -U datasets transformers matplotlib pandas
import re, itertools, textwrap
from collections import Counter
from typing import Dict
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
DATASET = "microsoft/OpenMementos"
ds_stream = load_dataset(DATASET, cut up="prepare", streaming=True)
first_row = subsequent(iter(ds_stream))
print("Columns :", checklist(first_row.keys()))
print("Domain :", first_row["domain"], "| Source:", first_row["source"])
print("Problem head:", first_row["problem"][:160].change("n", " "), "...")
We set up the required libraries and import the core instruments wanted for dataset streaming, parsing, evaluation, and visualization. We then connect with the Microsoft OpenMementos dataset in streaming mode to examine it with out downloading the whole dataset domestically. By studying the primary instance, we start understanding the dataset schema, the issue format, and the area and supply metadata hooked up to every reasoning hint.
BLOCK_RE = re.compile(r"<|block_start|>(.*?)<|block_end|>", re.DOTALL)
SUMMARY_RE = re.compile(r"<|summary_start|>(.*?)<|summary_end|>", re.DOTALL)
THINK_RE = re.compile(r"<assume>(.*?)</assume>", re.DOTALL)
def parse_memento(response: str) -> Dict:
blocks = [m.strip() for m in BLOCK_RE.findall(response)]
summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
think_m = THINK_RE.search(response)
final_ans = response.cut up("</assume>")[-1].strip() if "</assume>" in response else ""
return {"blocks": blocks, "summaries": summaries,
"reasoning": (think_m.group(1) if think_m else ""),
"final_answer": final_ans}
parsed = parse_memento(first_row["response"])
print(f"n→ {len(parsed['blocks'])} blocks, {len(parsed['summaries'])} mementos parsed")
print("First block :", parsed["blocks"][0][:140].change("n", " "), "...")
print("First memento :", parsed["summaries"][0][:140].change("n", " "), "...")
N_SAMPLES = 500
rows = []
for i, ex in enumerate(itertools.islice(
load_dataset(DATASET, cut up="prepare", streaming=True), N_SAMPLES)):
p = parse_memento(ex["response"])
if not p["blocks"] or len(p["blocks"]) != len(p["summaries"]):
proceed
blk_c = sum(len(b) for b in p["blocks"])
sum_c = sum(len(s) for s in p["summaries"])
blk_w = sum(len(b.cut up()) for b in p["blocks"])
sum_w = sum(len(s.cut up()) for s in p["summaries"])
rows.append(dict(area=ex["domain"], supply=ex["source"],
n_blocks=len(p["blocks"]),
block_chars=blk_c, summ_chars=sum_c,
block_words=blk_w, summ_words=sum_w,
compress_char=sum_c / max(blk_c, 1),
compress_word=sum_w / max(blk_w, 1)))
if (i + 1) % 100 == 0:
print(f" processed {i+1}/{N_SAMPLES}")
df = pd.DataBody(rows)
print(f"nAnalyzed {len(df)} rows. Domain counts:")
print(df["domain"].value_counts().to_string())
per_dom = df.groupby("area").agg(
n=("area", "rely"),
median_blocks=("n_blocks", "median"),
median_block_words=("block_words", "median"),
median_summ_words=("summ_words", "median"),
median_char_ratio=("compress_char", "median"),
median_word_ratio=("compress_word", "median"),
).spherical(3)
print("nPer-domain medians (ratio = mementos / blocks):")
print(per_dom.to_string())
We outline the regex-based parser that extracts reasoning blocks, memento summaries, the principle pondering part, and the ultimate reply from every response. We take a look at the parser on the primary streamed instance and verify that the block-summary construction is being captured appropriately. We then run a streaming evaluation over a number of samples to compute block counts, phrase counts, character counts, and compression ratios, which helps us research how the dataset behaves throughout examples and domains.
def compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
if not blocks or len(blocks) != len(summaries):
return response
out, n = ["<think>"], len(blocks)
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= n - keep_last_k:
out.append(f"<|block_start|>{b}<|block_end|>")
out.append(f"<|summary_start|>{s}<|summary_end|>")
else:
out.append(f"<|summary_start|>{s}<|summary_end|>")
out.append("</assume>")
out.append(response.cut up("</assume>")[-1])
return "n".be part of(out)
orig, comp = first_row["response"], compress_trace(first_row["response"], 1)
print(f"nOriginal : {len(orig):>8,} chars")
print(f"Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of authentic)")
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
MEM_TOKENS = ["<|block_start|>", "<|block_end|>",
"<|summary_start|>", "<|summary_end|>",
"<think>", "</think>"]
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)
blk_tok = sum(tlen(b) for b in parsed["blocks"])
sum_tok = sum(tlen(s) for s in parsed["summaries"])
print(f"nTrace-level token compression for this instance:")
print(f" block tokens = {blk_tok}")
print(f" memento tokens = {sum_tok}")
print(f" compression = {blk_tok / max(sum_tok,1):.2f}× (paper stories ~6×)")
def to_chat(ex):
return {"messages": [
{"role": "user", "content": ex["problem"]},
{"position": "assistant", "content material": ex["response"]},
]}
chat_stream = load_dataset(DATASET, cut up="prepare", streaming=True).map(to_chat)
chat_ex = subsequent(iter(chat_stream))
print("nSFT chat instance (truncated):")
for m in chat_ex["messages"]:
print(f" [{m['role']:9s}] {m['content'][:130].change(chr(10),' ')}...")
We visualize the dataset’s structural patterns by plotting block counts, compression ratios, and the connection between block dimension and memento dimension. We examine these distributions throughout domains to see how reasoning group differs between math, code, and science examples. We additionally stream one instance from the total subset and examine its extra sentence-level and block-alignment fields, which helps us perceive the richer inside annotation pipeline behind the dataset.
def compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
if not blocks or len(blocks) != len(summaries):
return response
out, n = ["<think>"], len(blocks)
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= n - keep_last_k:
out.append(f"<|block_start|>{b}<|block_end|>")
out.append(f"<|summary_start|>{s}<|summary_end|>")
else:
out.append(f"<|summary_start|>{s}<|summary_end|>")
out.append("</assume>")
out.append(response.cut up("</assume>")[-1])
return "n".be part of(out)
orig, comp = first_row["response"], compress_trace(first_row["response"], 1)
print(f"nOriginal : {len(orig):>8,} chars")
print(f"Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of authentic)")
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
MEM_TOKENS = ["<|block_start|>", "<|block_end|>",
"<|summary_start|>", "<|summary_end|>",
"<think>", "</think>"]
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)
blk_tok = sum(tlen(b) for b in parsed["blocks"])
sum_tok = sum(tlen(s) for s in parsed["summaries"])
print(f"nTrace-level token compression for this instance:")
print(f" block tokens = {blk_tok}")
print(f" memento tokens = {sum_tok}")
print(f" compression = {blk_tok / max(sum_tok,1):.2f}× (paper stories ~6×)")
def to_chat(ex):
return {"messages": [
{"role": "user", "content": ex["problem"]},
{"position": "assistant", "content material": ex["response"]},
]}
chat_stream = load_dataset(DATASET, cut up="prepare", streaming=True).map(to_chat)
chat_ex = subsequent(iter(chat_stream))
print("nSFT chat instance (truncated):")
for m in chat_ex["messages"]:
print(f" [{m['role']:9s}] {m['content'][:130].change(chr(10),' ')}...")
We simulate inference-time compression by rewriting a reasoning hint in order that older blocks are changed by their mementos whereas the newest blocks stay intact. We then examine the unique and compressed hint lengths to see how a lot context will be diminished in observe. After that, we combine a tokenizer, add particular memento tokens, measure token-level compression, and convert the dataset to an SFT-style chat format appropriate for coaching workflows.
def render_trace(response: str, width: int = 220) -> None:
p = parse_memento(response)
print("=" * 72)
print(f"{len(p['blocks'])} blocks · {len(p['summaries'])} mementos")
print("=" * 72)
for i, (b, s) in enumerate(zip(p["blocks"], p["summaries"]), 1):
ratio = len(s) / max(len(b), 1) * 100
print(f"n
BLOCK {i} ({len(b):,} chars)")
print(textwrap.indent(textwrap.shorten(b.change("n", " "), width=width), " "))
print(f"
MEMENTO {i} ({len(s):,} chars · {ratio:.1f}% of block)")
print(textwrap.indent(textwrap.shorten(s.change("n", " "), width=width), " "))
if p["final_answer"]:
print("n★ FINAL ANSWER")
print(textwrap.indent(textwrap.shorten(p["final_answer"].change("n"," "),
width=width*2), " "))
render_trace(first_row["response"])
We construct a pretty-printer that renders a single reasoning hint in a way more readable block-by-block format. We show every block alongside its paired memento and calculate the abstract’s dimension relative to the unique block, making the compression impact straightforward to examine manually. By operating this renderer on the primary instance, we create a clear qualitative view of how OpenMementos organizes reasoning and preserves important data by summaries.
In conclusion, we gained a transparent view of how OpenMementos represents reasoning as a sequence of detailed blocks paired with concise mementos, and we noticed why this construction is beneficial for context compression. We parsed actual examples, computed domain-level statistics, in contrast block and abstract lengths, and noticed how compressed traces can scale back token utilization whereas nonetheless retaining key data. We additionally aligned the streamed dataset format with the total subset, transformed the info to an SFT-ready chat construction, and constructed instruments to extra clearly examine traces. Through this end-to-end workflow, we perceive the dataset itself and see the way it can function a sensible basis for finding out reasoning traces, memory-style summarization, and environment friendly long-context mannequin habits.
Check out the Full Codes here. Also, be happy to observe us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation appeared first on MarkTechPost.
