Modern language fashions are educated on information with extraordinarily uneven token distributions. A small variety of phrases seem in nearly each sentence, whereas many uncommon however significant tokens happen solely sometimes. This creates a hidden optimization problem: parameters related to frequent tokens obtain fixed gradient updates, whereas parameters tied to uncommon tokens could go tons of or 1000’s of steps with out receiving any significant sign. Under commonplace Stochastic Gradient Descent (SGD), each parameter makes use of the identical studying price, so steadily up to date weights converge rapidly whereas rare-token weights typically stay near their random initialization.

This is the place Adam’s adaptive optimization turns into vital. While Adam is often described as SGD with momentum, its most impactful characteristic in follow is variance normalization. Adam tracks the historic gradient statistics for every parameter independently and robotically adjusts replace sizes based mostly on how typically dependable gradient info has been noticed. Parameters that not often obtain updates find yourself getting proportionally bigger efficient studying charges, permitting underrepresented options to be taught a lot quicker than they might below vanilla SGD.

To reveal this conduct concretely, we construct a managed NumPy experiment utilizing a six-token vocabulary spanning 4 orders of magnitude in frequency — from tokens showing in almost each batch to tokens showing solely 0.1% of the time. We prepare the identical linear mannequin twice, as soon as with SGD and as soon as with Adam, whereas protecting all goal weights equivalent. By evaluating remaining parameter values, non-zero gradient counts, and Adam’s efficient studying charges for every token, we will instantly observe how adaptive optimization compensates for frequency imbalance in actual coaching dynamics.

Setting up the dependencies

We start by developing a intentionally simplified coaching atmosphere that isolates a single issue: token frequency. The vocabulary comprises six tokens starting from extraordinarily frequent phrases like “the” to very uncommon tokens like “thalweg,” with look chances spanning 4 orders of magnitude. Every token is assigned the identical ground-truth significance — the right weight for all tokens is ready to 1.0 — so the experiment removes semantic complexity and focuses fully on how typically every parameter receives gradient updates.

Each coaching pattern is represented as a sparse binary vector indicating which tokens are current within the batch. The goal worth is solely the sum of the energetic token weights plus a small quantity of noise. We then prepare a small linear mannequin utilizing this artificial dataset. Because gradients are solely computed for tokens that seem in a batch, uncommon tokens naturally obtain far fewer updates than frequent ones. This setup creates a clear atmosphere for observing how SGD and Adam behave below extremely imbalanced gradient publicity.

Copy Code

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

np.random.seed(42)

Copy Code

TOKENS = ["the", "model", "embedding", "tokenization", "xenobiotic", "thalweg"]
# Appearance chance per batch -- spans 4 orders of magnitude
FREQ   = np.array([0.95,   0.60,    0.20,          0.05,          0.005,       0.001])
TRUE_W = np.ones(6)   # all weights ought to attain 1.0

N_STEPS   = 3000
LR        = 0.05
BATCH_SIZE = 32     # samples per step


def sample_batch(batch_size):
    """
    Each pattern is a sparse binary characteristic vector.
    Token i seems within the pattern with chance FREQ[i].
    Target y = x @ TRUE_W + small noise.
    """
    X = (np.random.rand(batch_size, 6) < FREQ).astype(float)
    y = X @ TRUE_W + np.random.randn(batch_size) * 0.1
    return X, y

SGD

We first prepare the mannequin utilizing commonplace mini-batch SGD. The mannequin weights are initialized to zero, and at each coaching step we pattern a batch, compute the prediction error, calculate the typical gradient throughout the batch, and replace the weights utilizing a set studying price. The implementation additionally information the complete weight trajectory over time together with the variety of steps by which every parameter obtained a non-zero gradient.

The key conduct emerges from the sparsity of the enter vectors. A token solely contributes to the gradient when it seems within the sampled batch. For frequent tokens, this occurs nearly each step, so their related weights obtain frequent updates and converge rapidly. Rare tokens, nonetheless, are absent from most batches, inflicting their gradients to stay close to zero for lengthy stretches of coaching. As a consequence, SGD spends most of its optimization effort on high-frequency tokens whereas low-frequency tokens barely transfer from initialization.

Copy Code

def train_sgd(n_steps, lr, batch_size):
    w        = np.zeros(6)
    historical past  = np.zeros((n_steps, 6))   # weight trajectory per token
    grad_counts = np.zeros(6)            # what number of non-zero gradients every weight obtained

    for t in vary(n_steps):
        X, y    = sample_batch(batch_size)
        error   = X @ w - y
        grad    = (X.T @ error) / batch_size
        w      -= lr * grad

        grad_counts += (np.abs(grad) > 1e-9).astype(float)
        historical past[t]  = w.copy()

    return historical past, grad_counts

ADAM

We now prepare the identical mannequin utilizing Adam to watch how adaptive optimization adjustments the training dynamics. Alongside the mannequin weights, Adam maintains two further operating statistics for each parameter: a momentum estimate mmm, which tracks the typical route of previous gradients, and a variance estimate vvv, which tracks the typical magnitude of squared gradients. Before making use of updates, each statistics are bias-corrected to account for his or her initialization at zero.

Copy Code

def train_adam(n_steps, lr, batch_size, beta1=0.9, beta2=0.999, eps=1e-8):
    w        = np.zeros(6)
    m        = np.zeros(6)
    v        = np.zeros(6)
    historical past  = np.zeros((n_steps, 6))
    v_history = np.zeros((n_steps, 6))  # observe variance accumulation

    for t in vary(1, n_steps + 1):
        X, y   = sample_batch(batch_size)
        error  = X @ w - y
        grad   = (X.T @ error) / batch_size

        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad ** 2

        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)

        w -= lr * m_hat / (np.sqrt(v_hat) + eps)

        historical past[t-1]   = w.copy()
        v_history[t-1] = v_hat.copy()

    return historical past, v_history

Running each

With each optimizers carried out, we prepare the mannequin twice below equivalent circumstances — as soon as utilizing SGD and as soon as utilizing Adam. Each optimizer sees the identical artificial information distribution, makes use of the identical initialization, studying price, batch dimension, and coaching length. This ensures that any distinction within the remaining conduct comes fully from the optimization technique itself quite than adjustments within the dataset or mannequin structure.

Copy Code

print("Training SGD...")
sgd_history, sgd_grad_counts = train_sgd(N_STEPS, LR, BATCH_SIZE)

print("Training Adam...")
adam_history, adam_v_history = train_adam(N_STEPS, LR, BATCH_SIZE)

print()

Measuring the failure

We now consider how nicely every optimizer realized the token weights after coaching. Since each token has the identical true goal weight of 1.0, the best end result is that each one realized weights additionally finish near 1.0 no matter token frequency. Along with the ultimate weights, we additionally measure what number of coaching steps every token truly obtained a non-zero gradient. This helps us instantly evaluate optimization high quality towards gradient publicity frequency.

The outcomes clearly present the distinction between SGD and Adam. For frequent tokens, each optimizers be taught the right weights efficiently as a result of these tokens seem in nearly each batch. But for uncommon tokens, SGD struggles badly. “xenobiotic” solely receives gradients in about 15% of coaching steps and its weight stops round 0.53 as a substitute of 1.0. The rarest token, “thalweg,” receives gradients in solely 3.4% of steps and SGD barely learns it in any respect, ending close to 0.15. Adam, nonetheless, retains each rare-token weights near the right worth regardless of receiving the identical sparse gradient alerts.

Copy Code

sgd_final  = sgd_history[-1]
adam_final = adam_history[-1]

print("=" * 62)
print(f"{'Token':<16} {'Freq':>6}  {'SGD w':>8}  {'Adam w':>8}  {'SGD grads':>10}")
print("-" * 62)
for i, token in enumerate(TOKENS):
    sgd_err  = abs(sgd_final[i]  - TRUE_W[i])
    adam_err = abs(adam_final[i] - TRUE_W[i])
    flag = "  ← fails" if sgd_err > 0.3 else ""
    print(
        f"{token:<16} {FREQ[i]:>6.3f}  {sgd_final[i]:>8.4f}  "
        f"{adam_final[i]:>8.4f}  {int(sgd_grad_counts[i]):>10}{flag}"
    )
print()
print(f"True weight for all tokens: {TRUE_W[0]:.1f}")
print()

# How many steps did every token get a non-zero gradient?
print("Non-zero gradient steps out of", N_STEPS)
for i, token in enumerate(TOKENS):
    pct = sgd_grad_counts[i] / N_STEPS * 100
    bar = "█" * int(pct / 2)
    print(f"  {token:<16} {bar:<50} {pct:.1f}%")

print()

Effective Learning Rate

To perceive why Adam succeeds on uncommon tokens, we study its efficient studying price for every parameter on the finish of coaching. Adam doesn’t use the identical replace scale for each weight. Instead, every parameter’s replace is split by the sq. root of its accrued variance estimate vvv. This means the sensible step dimension depends upon how massive or small that variance has change into throughout coaching.

The numbers reveal a transparent sample. Common tokens akin to “the” and “mannequin” accumulate massive variance values as a result of they obtain gradients nearly each step, so their efficient studying charges stay comparatively small. Rare tokens behave very otherwise. Since “xenobiotic” and “thalweg” obtain gradients solely sometimes, their variance estimates keep tiny, inflicting Adam to robotically amplify their efficient studying charges by an enormous quantity. Even although the nominal studying price is fastened at 0.05, the rarest token finally ends up receiving an efficient step dimension above 40. This adaptive scaling is the core motive Adam can be taught sparse parameters that SGD fails to optimize correctly.

Copy Code

eps = 1e-8
adam_v_final    = adam_v_history[-1]
effective_lr    = LR / (np.sqrt(adam_v_final) + eps)

print("=" * 55)
print("Adam Effective Learning Rate (remaining step)")
print("=" * 55)
for i, token in enumerate(TOKENS):
    print(f"  {token:<16}  v_hat={adam_v_final[i]:.6f}  lr_eff={effective_lr[i]:.4f}")
print()
print(f"Nominal LR: {LR}")
print("Rare tokens get an robotically amplified efficient LR.")
print()

Visualizing the Results

Finally, we visualize the complete coaching dynamics to check how SGD and Adam behave throughout tokens with vastly totally different frequencies. The first two plots observe the burden trajectories throughout coaching, exhibiting whether or not every optimizer can transfer rare-token parameters towards the right worth. We additionally evaluate the ultimate weight errors for each token to measure total studying high quality.

The 4 charts inform a single story throughout two optimizers. The top-left reveals SGD’s weight trajectories: frequent tokens (darkish and medium blue) shoot as much as 1.0 throughout the first few hundred steps, whereas the 2 uncommon tokens — xenobiotic and thalweg — barely go away the ground, crawling to 0.53 and 0.15 respectively in any case 3,000 steps. The top-right bar chart makes the harm concrete: SGD’s error bars for xenobiotic and thalweg dwarf every thing else, whereas Adam’s blue bars keep uniformly small throughout all six tokens.

The bottom-left reveals Adam’s trajectories — all six tokens converge to 1.0, together with the uncommon ones, although with extra oscillation as a result of every uncommon gradient replace carries a big amplified step. The bottom-right explains why: plotted on a log-log scale, the connection between token frequency and Adam’s efficient studying price is a clear inverse — thalweg sits on the top-left with a 41× amplified efficient LR, “the” sits on the bottom-right close to the nominal 0.05, and each different token falls on the identical diagonal. Adam didn’t obtain any particular directions about which tokens had been uncommon; the variance time period computed it robotically from gradient historical past alone.

Copy Code

BG   = "#fafaf8"
DARK = "#1a1a1a"

# Color ramp: blue for frequent tokens, crimson for uncommon
TOKEN_COLORS = ["#1a5276", "#2471a3", "#5dade2", "#e67e22", "#c0392b", "#7d2a2a"]

steps = np.arange(N_STEPS)

fig = plt.determine(figsize=(16, 11), facecolor=BG)
fig.suptitle(
    "SGD vs. Adam on Rare Tokens -- Frequency Bias and Variance Normalization",
    fontsize=14, fontweight="daring", shade=DARK, y=0.99
)

gs = gridspec.GridSpec(2, 3, determine=fig, hspace=0.45, wspace=0.35)

# ── 1. SGD weight trajectories ────────────────────────────────
ax1 = fig.add_subplot(gs[0, :2])
ax1.set_facecolor(BG)
ax1.axhline(1.0, shade=DARK, lw=1, ls="--", alpha=0.3, label="True weight = 1.0")

for i, (token, shade) in enumerate(zip(TOKENS, TOKEN_COLORS)):
    ax1.plot(steps, sgd_history[:, i], shade=shade, lw=1.8,
             label=f"{token} (freq={FREQ[i]:.3f})")

ax1.set_title("SGD -- Weight TrajectoriesnRare tokens barely transfer from zero", fontsize=11, shade=DARK)
ax1.set_xlabel("Training Step", fontsize=9)
ax1.set_ylabel("Learned Weight", fontsize=9)
ax1.legend(fontsize=8, loc="proper")
ax1.set_ylim(-0.3, 1.6)
ax1.spines[["top", "right"]].set_visible(False)

# Annotate failure zone
ax1.annotate(
    "Rare tokens stucknnear zero",
    xy=(N_STEPS * 0.95, sgd_history[-1, 5]),
    xytext=(N_STEPS * 0.65, -0.15),
    fontsize=8.5, shade="#c0392b",
    arrowprops=dict(arrowstyle="->", shade="#c0392b", lw=1.2),
    bbox=dict(boxstyle="spherical,pad=0.3", facecolor="#fff0f0", edgecolor="#c0392b", alpha=0.85)
)

# ── 2. Final weight error bar chart ───────────────────────────
ax2 = fig.add_subplot(gs[0, 2])
ax2.set_facecolor(BG)

x      = np.arange(6)
w_sgd  = sgd_final
w_adam = adam_final
width  = 0.35

bars_sgd  = ax2.bar(x - width/2, np.abs(w_sgd  - TRUE_W), width, shade="#c0392b", alpha=0.85, label="SGD error")
bars_adam = ax2.bar(x + width/2, np.abs(w_adam - TRUE_W), width, shade="#2980b9", alpha=0.85, label="Adam error")

ax2.set_xticks(x)
ax2.set_xticklabels([t[:8] for t in TOKENS], rotation=30, ha="proper", fontsize=8)
ax2.set_ylabel("|realized w − true w|", fontsize=9)
ax2.set_title("Final Weight Errorn(decrease = higher)", fontsize=11, shade=DARK)
ax2.legend(fontsize=8)
ax2.spines[["top", "right"]].set_visible(False)

# ── 3. Adam weight trajectories ───────────────────────────────
ax3 = fig.add_subplot(gs[1, :2])
ax3.set_facecolor(BG)
ax3.axhline(1.0, shade=DARK, lw=1, ls="--", alpha=0.3, label="True weight = 1.0")

for i, (token, shade) in enumerate(zip(TOKENS, TOKEN_COLORS)):
    ax3.plot(steps, adam_history[:, i], shade=shade, lw=1.8,
             label=f"{token} (freq={FREQ[i]:.3f})")

ax3.set_title("Adam -- Weight TrajectoriesnRare tokens converge through variance normalization", fontsize=11, shade=DARK)
ax3.set_xlabel("Training Step", fontsize=9)
ax3.set_ylabel("Learned Weight", fontsize=9)
ax3.legend(fontsize=8, loc="proper")
ax3.set_ylim(-0.3, 1.6)
ax3.spines[["top", "right"]].set_visible(False)

ax3.annotate(
    "Rare tokens convergendespite sparse gradients",
    xy=(N_STEPS * 0.95, adam_history[-1, 5]),
    xytext=(N_STEPS * 0.60, 0.3),
    fontsize=8.5, shade="#27ae60",
    arrowprops=dict(arrowstyle="->", shade="#27ae60", lw=1.2),
    bbox=dict(boxstyle="spherical,pad=0.3", facecolor="#f0fff4", edgecolor="#27ae60", alpha=0.85)
)

# ── 4. Effective LR vs frequency ─────────────────────────────
ax4 = fig.add_subplot(gs[1, 2])
ax4.set_facecolor(BG)

ax4.scatter(FREQ, effective_lr, c=TOKEN_COLORS, s=120, zorder=5, edgecolors="white", lw=1.5)
for i, token in enumerate(TOKENS):
    ax4.annotate(token, (FREQ[i], effective_lr[i]),
                 textcoords="offset factors", xytext=(6, 4), fontsize=7.5, shade=TOKEN_COLORS[i])

ax4.axhline(LR, shade=DARK, lw=1, ls="--", alpha=0.4)
ax4.textual content(0.5, LR * 1.05, f"Nominal LR = {LR}", fontsize=8, shade=DARK, alpha=0.6)

ax4.set_xscale("log")
ax4.set_yscale("log")
ax4.set_xlabel("Token Frequency (log scale)", fontsize=9)
ax4.set_ylabel("Adam Effective LR  lr/√v̂  (log scale)", fontsize=9)
ax4.set_title("Adam's Automatic EqualizernRare tokens get amplified LR", fontsize=11, shade=DARK)
ax4.spines[["top", "right"]].set_visible(False)

plt.savefig("sgd_vs_adam.png", dpi=150, bbox_inches="tight", facecolor=BG)
plt.present()

Check out the Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The put up Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It appeared first on MarkTechPost.

Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It

Setting up the dependencies

SGD

ADAM

Running each

Measuring the failure

Effective Learning Rate

Visualizing the Results

Leak suggests OpenAI’s open-source AI model release is imminent

Baidu Releases ERNIE-4.5-VL-28B-A3B-Thinking: An Open-Source and Compact Multimodal Reasoning Model Under the ERNIE-4.5 Family

Google AI Introduces Stax: A Practical AI Tool for Evaluating Large Language Models LLMs

GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Mistral AI Releases Voxtral: The World’s Best (and Open) Speech Recognition Models

MemAgent: A Reinforcement Learning Framework Redefining Long-Context Processing in LLMs

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Setting up the dependencies

SGD

ADAM

Running each

Measuring the failure

Effective Learning Rate

Visualizing the Results

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!