|

What is Tokenization Drift and How to Fix It?

A mannequin can behave completely one second and degrade the subsequent—with none change to your information, pipeline, or logic. The root trigger usually lies in one thing much more delicate: how your enter is tokenized. Before a mannequin processes textual content, it converts it into token IDs, and even minor formatting variations—like spacing, line breaks, or punctuation—can produce completely completely different token sequences. This phenomenon is often called tokenization drift: when small surface-level adjustments push your enter into a distinct area of token area, main to unpredictable shifts in mannequin conduct.

The influence goes deeper than simply token IDs. During instruction tuning, fashions be taught not solely duties but additionally the construction during which these duties are offered—particular separators, prefixes, and formatting patterns. When your immediate deviates from these realized patterns, you might be not working inside the mannequin’s acquainted distribution. The consequence isn’t confusion—it’s a mannequin doing its finest on inputs it was by no means optimized to deal with.

In this text, we’ll break this down utilizing the GPT-2 tokenizer to present how small formatting adjustments have an effect on tokens, and construct a easy metric to measure drift throughout prompts.

Then, we’ll implement a light-weight immediate optimization loop to decide codecs that maintain your inputs constant and dependable.

Setting up the dependencies

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from collections import defaultdict
from sklearn.decomposition import PCA

In this code block, we’re loading the GPT-2 tokenizer — the identical Byte-Pair Encoding scheme utilized by GPT-4, LLaMA, and Mistral. We use GPT-2 particularly as a result of it requires no auth token and demonstrates the space-prefix artifact identically to each fashionable manufacturing tokenizer.

from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("gpt2")
 
print("Tokenizer loaded:", tokenizer.__class__.__name__)
print("Vocab dimension:", tokenizer.vocab_size)

Tokenization Artifact Demo

We take seven phrases and take a look at every in two varieties—as soon as with a number one area and as soon as with out—then encode them utilizing the GPT-2 tokenizer. By setting add_special_tokens=False, we guarantee we’re solely measuring the token IDs for the phrases themselves, with none additional padding or particular markers.

The outcomes are putting. Not a single pair produces the identical token ID—each phrase is handled as utterly completely different relying on whether or not it has a number one area. Even extra attention-grabbing, some phrases with out the area don’t map to a single token in any respect. For instance, “classify” turns into two tokens [4871, 1958], whereas “ classify” is a single token [36509]. This means the mannequin doesn’t simply see a distinct ID—it sees a distinct sequence size, which shifts how consideration is computed for the whole lot that follows.

pairs = [
    (" classify",  "classify"),
    (" answer",    "answer"),
    (" positive",  "positive"),
    (" negative",  "negative"),
    (" sentiment", "sentiment"),
    (" output",    "output"),
    (" label",     "label"),
]
 
print("=" * 60)
print(f"{'Token (with area)':<22} {'ID':>6}   {'Token (no area)':<20} {'ID':>6}  {'Same?':>6}")
print("=" * 60)
 
for with_space, without_space in pairs:
    id_ws  = tokenizer.encode(with_space,  add_special_tokens=False)
    id_nws = tokenizer.encode(without_space, add_special_tokens=False)
    match  = "✓" if id_ws == id_nws else "✗ DIFFERENT"
    print(f"{repr(with_space):<22} {str(id_ws):>8}   {repr(without_space):<20} {str(id_nws):>8}  {match}")
 
print()
print("Key takeaway: Leading areas create DIFFERENT token IDs.")
print("To the mannequin, ' classify' and 'classify' are as distinct as 'apple' and 'orange'.")

Visualising the Token ID Shift

We plot two charts to make the token ID hole visible. The left chart reveals the uncooked IDs side-by-side — blue for space-prefixed, purple for naked — and the proper chart plots absolutely the distance between every pair.

phrases     = [p[1] for p in pairs]
ids_ws    = [tokenizer.encode(" " + w,  add_special_tokens=False)[0] for w in phrases]
ids_nws   = [tokenizer.encode(w, add_special_tokens=False)[0] for w in phrases]
delta     = [abs(a - b) for a, b in zip(ids_ws, ids_nws)]
 
x = np.arange(len(phrases))
width = 0.35
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.patch.set_facecolor("#FAFAF8")
 
# Left: side-by-side token IDs
ax = axes[0]
ax.set_facecolor("#FAFAF8")
bars1 = ax.bar(x - width/2, ids_ws,  width, label='With main area',    coloration="#3B6FE0", alpha=0.85)
bars2 = ax.bar(x + width/2, ids_nws, width, label='Without main area',  coloration="#E05C3B", alpha=0.85)
ax.set_xticks(x)
ax.set_xticklabels(phrases, rotation=30, ha="proper", fontsize=9)
ax.set_ylabel("Token ID", fontsize=10)
ax.set_title("Token IDs: ' phrase'  vs  'phrase'", fontsize=12, fontweight="daring", pad=12)
ax.legend(fontsize=9)
ax.spines[["top", "right"]].set_visible(False)
ax.grid(axis="y", alpha=0.3)
 
for bar in bars1:
    ax.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
            str(int(bar.get_height())), ha="heart", va="backside", fontsize=7, coloration="#3B6FE0")
for bar in bars2:
    ax.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
            str(int(bar.get_height())), ha="heart", va="backside", fontsize=7, coloration="#E05C3B")
 
# Right: delta
ax2 = axes[1]
ax2.set_facecolor("#FAFAF8")
color_bars = ["#E05C3B" if d > 500 else "#F0A070" if d > 100 else "#A8C4F0" for d in delta]
bars3 = ax2.bar(phrases, delta, coloration=color_bars, alpha=0.9)
ax2.set_ylabel("Absolute Token ID Distance", fontsize=10)
ax2.set_title("How Far Apart Are the Token IDs?", fontsize=12, fontweight="daring", pad=12)
ax2.set_xticklabels(phrases, rotation=30, ha="proper", fontsize=9)
ax2.spines[["top", "right"]].set_visible(False)
ax2.grid(axis="y", alpha=0.3)
 
for bar, d in zip(bars3, delta):
    ax2.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
             str(d), ha="heart", va="backside", fontsize=9, fontweight="daring")
 
excessive  = mpatches.Patch(coloration="#E05C3B", alpha=0.9, label="> 500 aside")
med   = mpatches.Patch(coloration="#F0A070", alpha=0.9, label="100-500 aside")
low   = mpatches.Patch(coloration="#A8C4F0", alpha=0.9, label="< 100 aside")
ax2.legend(handles=[high, med, low], fontsize=8)
 
plt.tight_layout(pad=2)
plt.suptitle("Tokenization Artifacts: One Space, Completely Different Token", 
             fontsize=14, fontweight="daring", y=1.02)
plt.savefig("tokenization_artifact.png", dpi=150, bbox_inches="tight", facecolor="#FAFAF8")
plt.present()

Simulating Accuracy Drop Demo

We begin with a regular SFT immediate format and create a couple of variations by making small adjustments—like eradicating newlines, tweaking punctuation, or rewording the instruction. We then measure how related every model is utilizing token overlap. The key perception is that even small formatting adjustments can considerably alter the token sequence. For instance, eradicating newlines drops similarity to round 80%, exhibiting that these aren’t simply beauty—they’re indicators the mannequin depends on. The greatest influence comes from rewording the instruction, which cuts overlap to practically 50%, that means the immediate not resembles what the mannequin was educated on, rising the danger of unpredictable conduct.

def tokenize_prompt(textual content):
    return tokenizer.encode(textual content, add_special_tokens=False)
 
# The canonical SFT (fine-tuning) template -- what the mannequin was educated on
sft_template = (
    "Below is a buyer evaluate. Classify the sentiment.nn"
    "Review: {evaluate}nn"
    "Sentiment:"
)
 
# Prompt variants -- small adjustments, large token penalties
variants = {
    "✓ SFT template (optimum)":         "Below is a buyer evaluate. Classify the sentiment.nnReview: {evaluate}nnSentiment:",
    "✗ Removed newlines":               "Below is a buyer evaluate. Classify the sentiment. Review: {evaluate} Sentiment:",
    "✗ Removed main area on phrase":  "Below is a buyer evaluate. Classify the sentiment.nnReview:{evaluate}nnSentiment:",
    "✗ Colon → sprint":                   "Below is a buyer evaluate. Classify the sentiment.nnReview - {evaluate}nnSentiment -",
    "✗ Reworded instruction":           "Determine the sentiment of the next evaluate.nnReview: {evaluate}nnAnswer:",
}
 
sample_review = "The product exceeded all my expectations. Highly suggest!"
 
# Token overlap with the SFT template (proxy for in-distribution similarity)
sft_tokens = set(tokenize_prompt(sft_template.format(evaluate=sample_review)))
 
print("=" * 65)
print(f"{'Prompt Variant':<42} {'Shared Tokens':>14} {'OOD Risk':>9}")
print("=" * 65)
 
overlap_scores = {}
for title, template in variants.gadgets():
    immediate   = template.format(evaluate=sample_review)
    tokens   = set(tokenize_prompt(immediate))
    overlap  = len(sft_tokens & tokens) / len(sft_tokens | tokens)  # Jaccard
    ood_risk = "LOW" if overlap > 0.80 else "MEDIUM" if overlap > 0.60 else "HIGH"
    overlap_scores[name] = overlap
    print(f"  {title:<42} {overlap:>13.1%} {ood_risk:>9}")
 
print()
print("Jaccard similarity measures token-level overlap with the SFT template.")
print("Lower overlap → increased out-of-distribution threat → accuracy drops.")

Visualising the OOD Risk

fig, ax = plt.subplots(figsize=(11, 5))
fig.patch.set_facecolor("#FAFAF8")
ax.set_facecolor("#FAFAF8")
 
labels  = listing(overlap_scores.keys())
scores  = listing(overlap_scores.values())
colours  = ["#3B6FE0" if s > 0.80 else "#F0A070" if s > 0.60 else "#E05C3B" for s in scores]
 
bars = ax.barh(labels, scores, coloration=colours, alpha=0.88, top=0.55)
ax.axvline(x=0.80, coloration="#3B6FE0", linestyle="--", linewidth=1.4, alpha=0.6, label="Safe threshold (0.80)")
ax.axvline(x=0.60, coloration="#E05C3B", linestyle="--", linewidth=1.4, alpha=0.6, label="Danger threshold (0.60)")
ax.set_xlabel("Token Overlap with SFT Template (Jaccard)", fontsize=10)
ax.set_title("Out-of-Distribution Risk per Prompt Variant", fontsize=13, fontweight="daring", pad=12)
ax.set_xlim(0, 1.05)
ax.spines[["top", "right"]].set_visible(False)
ax.grid(axis="x", alpha=0.25)
 
for bar, rating in zip(bars, scores):
    ax.textual content(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
            f"{rating:.0%}", va="heart", fontsize=9, fontweight="daring")
 
ax.legend(fontsize=9)
plt.tight_layout(pad=2)
plt.savefig("ood_risk.png", dpi=150, bbox_inches="tight", facecolor="#FAFAF8")
plt.present()

Automated Prompt Optimisation (APO)

This is the place Automated Prompt Optimization (APO) turns into sensible. We take a small validation set and take a look at a number of immediate templates that differ in how carefully they match the unique SFT format. Instead of guessing which one works finest, we simulate mannequin efficiency by measuring token overlap with the SFT template and making use of a penalty for going out-of-distribution. Templates that deviate extra—like eradicating construction or closely rewording—get decrease efficient accuracy, whereas these nearer to the unique format carry out higher.

The outcomes are clear: most variants carry out poorly (round 40–50% accuracy), whereas the one which carefully matches the SFT template stands out with ~83% effectiveness. The APO loop merely picks the best-performing template routinely. In real-world programs, this similar thought is used at scale with precise mannequin outputs—take a look at a number of immediate codecs, rating them, and lock within the one which retains efficiency steady.

np.random.seed(42)
 
VALIDATION_SET = [
    {"review": "Absolutely terrible. Would not buy again.",          "label": "negative"},
    {"review": "Best purchase I have made this year!",               "label": "positive"},
    {"review": "Arrived broken. Customer service was unhelpful.",    "label": "negative"},
    {"review": "Good quality, fast delivery, very happy.",           "label": "positive"},
    {"review": "It's okay, nothing special.",                        "label": "neutral"},
    {"review": "Exceeded expectations. Premium feel.",               "label": "positive"},
    {"review": "Complete waste of money.",                           "label": "negative"},
    {"review": "Works as described, decent value.",                  "label": "neutral"},
]
 
CANDIDATE_PROMPTS = {
    "Variant A -- No formatting":
        "Classify: {evaluate} Answer:",
 
    "Variant B -- Minimal newline":
        "Review: {evaluate}nSentiment:",
 
    "Variant C -- SFT-aligned (newlines + colon)":
        "Below is a buyer evaluate. Classify the sentiment.nnReview: {evaluate}nnSentiment:",
 
    "Variant D -- XML tags":
        "<evaluate>{evaluate}</evaluate>n<sentiment>",
 
    "Variant E -- Full instruction block":
        "You are a sentiment classifier.nnInput: {evaluate}nnOutput (constructive/unfavorable/impartial):",
}
 
def simulate_model_output(prompt_template, evaluate, label, ood_penalty):
    """
    Simulate mannequin accuracy as a operate of:
      - base accuracy (0.85)
      - out-of-distribution penalty derived from token overlap
      - small random noise
    """
    tokens_template = set(tokenize_prompt(prompt_template.format(evaluate="")))
    tokens_sft      = set(tokenize_prompt(sft_template.format(evaluate="")))
    overlap         = len(tokens_template & tokens_sft) / max(len(tokens_template | tokens_sft), 1)
    
    base_acc = 0.85
    # OOD penalty: low overlap → low accuracy
    effective_acc = base_acc * (0.5 + 0.5 * overlap) - ood_penalty
    effective_acc = np.clip(effective_acc, 0.40, 0.95)
    
    # Simulate per-sample prediction
    right = np.random.rand() < effective_acc
    return right, effective_acc
 
# APO outer loop: consider every candidate on the validation set
print("=" * 65)
print("AUTOMATED PROMPT OPTIMISATION -- Validation Run")
print("=" * 65)
 
apo_results = {}
ood_penalties = {"Variant A": 0.18, "Variant B": 0.10, "Variant C": 0.02,
                 "Variant D": 0.08, "Variant E": 0.04}
 
for title, template in CANDIDATE_PROMPTS.gadgets():
    key = title.break up("--")[0].strip()
    penalty = ood_penalties.get(key, 0.05)
    
    correct_count = 0
    per_sample_acc = []
    for pattern in VALIDATION_SET:
        right, eff_acc = simulate_model_output(template, pattern["review"], pattern["label"], penalty)
        correct_count += int(right)
        per_sample_acc.append(eff_acc)
    
    accuracy = correct_count / len(VALIDATION_SET)
    apo_results[name] = {"accuracy": accuracy, "mean_eff_acc": np.imply(per_sample_acc)}
    print(f"  {title}")
    print(f"    → Simulated accuracy: {accuracy:.0%}  |  Mean efficient acc: {np.imply(per_sample_acc):.2%}n")
 
best_name = max(apo_results, key=lambda ok: apo_results[k]["accuracy"])
print(f"✓ APO SELECTED: {best_name}")
print(f"  Accuracy: {apo_results[best_name]['accuracy']:.0%}")

Visualising the APO Results

names     = [n.split("--")[0].strip() for n in apo_results.keys()]
accs      = [v["accuracy"] for v in apo_results.values()]
best_idx  = accs.index(max(accs))
 
colours = ["#3B6FE0" if i == best_idx else "#CBD5E8" for i in range(len(accs))]
 
fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_facecolor("#FAFAF8")
ax.set_facecolor("#FAFAF8")
 
bars = ax.bar(names, accs, coloration=colours, alpha=0.9, width=0.55)
ax.set_ylim(0, 1.0)
ax.set_ylabel("Simulated Validation Accuracy", fontsize=10)
ax.set_title("APO: Which Prompt Template Wins?", fontsize=13, fontweight="daring", pad=12)
ax.spines[["top", "right"]].set_visible(False)
ax.grid(axis="y", alpha=0.25)
 
for bar, acc in zip(bars, accs):
    ax.textual content(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.015,
            f"{acc:.0%}", ha="heart", fontsize=11, fontweight="daring",
            coloration="#3B6FE0" if bar.get_facecolor()[0] < 0.5 else "#555")
 
ax.textual content(bars[best_idx].get_x() + bars[best_idx].get_width()/2,
        accs[best_idx] + 0.06, "★ APO Pick", ha="heart",
        fontsize=10, coloration="#3B6FE0", fontweight="daring")
 
plt.tight_layout(pad=2)
plt.savefig("apo_results.png", dpi=150, bbox_inches="tight", facecolor="#FAFAF8")
plt.present()

Check out the Full Codes here. Also, be at liberty to observe us on Twitter and don’t overlook to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish What is Tokenization Drift and How to Fix It? appeared first on MarkTechPost.

Similar Posts