Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It
Modern language fashions are educated on information with extraordinarily uneven token distributions. A small variety of phrases seem in nearly each sentence, whereas many uncommon however significant tokens happen solely sometimes. This creates a hidden optimization problem: parameters related to frequent tokens obtain fixed gradient updates, whereas parameters tied to uncommon tokens could go tons of or 1000’s of steps with out receiving any significant sign. Under commonplace Stochastic Gradient Descent (SGD), each parameter makes use of the identical studying price, so steadily up to date weights converge rapidly whereas rare-token weights typically stay near their random initialization.
This is the place Adam’s adaptive optimization turns into vital. While Adam is often described as SGD with momentum, its most impactful characteristic in follow is variance normalization. Adam tracks the historic gradient statistics for every parameter independently and robotically adjusts replace sizes based mostly on how typically dependable gradient info has been noticed. Parameters that not often obtain updates find yourself getting proportionally bigger efficient studying charges, permitting underrepresented options to be taught a lot quicker than they might below vanilla SGD.
To reveal this conduct concretely, we construct a managed NumPy experiment utilizing a six-token vocabulary spanning 4 orders of magnitude in frequency — from tokens showing in almost each batch to tokens showing solely 0.1% of the time. We prepare the identical linear mannequin twice, as soon as with SGD and as soon as with Adam, whereas protecting all goal weights equivalent. By evaluating remaining parameter values, non-zero gradient counts, and Adam’s efficient studying charges for every token, we will instantly observe how adaptive optimization compensates for frequency imbalance in actual coaching dynamics.


Setting up the dependencies
We start by developing a intentionally simplified coaching atmosphere that isolates a single issue: token frequency. The vocabulary comprises six tokens starting from extraordinarily frequent phrases like “the” to very uncommon tokens like “thalweg,” with look chances spanning 4 orders of magnitude. Every token is assigned the identical ground-truth significance — the right weight for all tokens is ready to 1.0 — so the experiment removes semantic complexity and focuses fully on how typically every parameter receives gradient updates.
Each coaching pattern is represented as a sparse binary vector indicating which tokens are current within the batch. The goal worth is solely the sum of the energetic token weights plus a small quantity of noise. We then prepare a small linear mannequin utilizing this artificial dataset. Because gradients are solely computed for tokens that seem in a batch, uncommon tokens naturally obtain far fewer updates than frequent ones. This setup creates a clear atmosphere for observing how SGD and Adam behave below extremely imbalanced gradient publicity.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
np.random.seed(42)
TOKENS = ["the", "model", "embedding", "tokenization", "xenobiotic", "thalweg"]
# Appearance chance per batch -- spans 4 orders of magnitude
FREQ = np.array([0.95, 0.60, 0.20, 0.05, 0.005, 0.001])
TRUE_W = np.ones(6) # all weights ought to attain 1.0
N_STEPS = 3000
LR = 0.05
BATCH_SIZE = 32 # samples per step
def sample_batch(batch_size):
"""
Each pattern is a sparse binary characteristic vector.
Token i seems within the pattern with chance FREQ[i].
Target y = x @ TRUE_W + small noise.
"""
X = (np.random.rand(batch_size, 6) < FREQ).astype(float)
y = X @ TRUE_W + np.random.randn(batch_size) * 0.1
return X, y
SGD
We first prepare the mannequin utilizing commonplace mini-batch SGD. The mannequin weights are initialized to zero, and at each coaching step we pattern a batch, compute the prediction error, calculate the typical gradient throughout the batch, and replace the weights utilizing a set studying price. The implementation additionally information the complete weight trajectory over time together with the variety of steps by which every parameter obtained a non-zero gradient.
The key conduct emerges from the sparsity of the enter vectors. A token solely contributes to the gradient when it seems within the sampled batch. For frequent tokens, this occurs nearly each step, so their related weights obtain frequent updates and converge rapidly. Rare tokens, nonetheless, are absent from most batches, inflicting their gradients to stay close to zero for lengthy stretches of coaching. As a consequence, SGD spends most of its optimization effort on high-frequency tokens whereas low-frequency tokens barely transfer from initialization.
def train_sgd(n_steps, lr, batch_size):
w = np.zeros(6)
historical past = np.zeros((n_steps, 6)) # weight trajectory per token
grad_counts = np.zeros(6) # what number of non-zero gradients every weight obtained
for t in vary(n_steps):
X, y = sample_batch(batch_size)
error = X @ w - y
grad = (X.T @ error) / batch_size
w -= lr * grad
grad_counts += (np.abs(grad) > 1e-9).astype(float)
historical past[t] = w.copy()
return historical past, grad_counts
ADAM
We now prepare the identical mannequin utilizing Adam to watch how adaptive optimization adjustments the training dynamics. Alongside the mannequin weights, Adam maintains two further operating statistics for each parameter: a momentum estimate mmm, which tracks the typical route of previous gradients, and a variance estimate vvv, which tracks the typical magnitude of squared gradients. Before making use of updates, each statistics are bias-corrected to account for his or her initialization at zero.
def train_adam(n_steps, lr, batch_size, beta1=0.9, beta2=0.999, eps=1e-8):
w = np.zeros(6)
m = np.zeros(6)
v = np.zeros(6)
historical past = np.zeros((n_steps, 6))
v_history = np.zeros((n_steps, 6)) # observe variance accumulation
for t in vary(1, n_steps + 1):
X, y = sample_batch(batch_size)
error = X @ w - y
grad = (X.T @ error) / batch_size
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad ** 2
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
w -= lr * m_hat / (np.sqrt(v_hat) + eps)
historical past[t-1] = w.copy()
v_history[t-1] = v_hat.copy()
return historical past, v_history
Running each
With each optimizers carried out, we prepare the mannequin twice below equivalent circumstances — as soon as utilizing SGD and as soon as utilizing Adam. Each optimizer sees the identical artificial information distribution, makes use of the identical initialization, studying price, batch dimension, and coaching length. This ensures that any distinction within the remaining conduct comes fully from the optimization technique itself quite than adjustments within the dataset or mannequin structure.
print("Training SGD...")
sgd_history, sgd_grad_counts = train_sgd(N_STEPS, LR, BATCH_SIZE)
print("Training Adam...")
adam_history, adam_v_history = train_adam(N_STEPS, LR, BATCH_SIZE)
print()
Measuring the failure
We now consider how nicely every optimizer realized the token weights after coaching. Since each token has the identical true goal weight of 1.0, the best end result is that each one realized weights additionally finish near 1.0 no matter token frequency. Along with the ultimate weights, we additionally measure what number of coaching steps every token truly obtained a non-zero gradient. This helps us instantly evaluate optimization high quality towards gradient publicity frequency.
The outcomes clearly present the distinction between SGD and Adam. For frequent tokens, each optimizers be taught the right weights efficiently as a result of these tokens seem in nearly each batch. But for uncommon tokens, SGD struggles badly. “xenobiotic” solely receives gradients in about 15% of coaching steps and its weight stops round 0.53 as a substitute of 1.0. The rarest token, “thalweg,” receives gradients in solely 3.4% of steps and SGD barely learns it in any respect, ending close to 0.15. Adam, nonetheless, retains each rare-token weights near the right worth regardless of receiving the identical sparse gradient alerts.
sgd_final = sgd_history[-1]
adam_final = adam_history[-1]
print("=" * 62)
print(f"{'Token':<16} {'Freq':>6} {'SGD w':>8} {'Adam w':>8} {'SGD grads':>10}")
print("-" * 62)
for i, token in enumerate(TOKENS):
sgd_err = abs(sgd_final[i] - TRUE_W[i])
adam_err = abs(adam_final[i] - TRUE_W[i])
flag = " ← fails" if sgd_err > 0.3 else ""
print(
f"{token:<16} {FREQ[i]:>6.3f} {sgd_final[i]:>8.4f} "
f"{adam_final[i]:>8.4f} {int(sgd_grad_counts[i]):>10}{flag}"
)
print()
print(f"True weight for all tokens: {TRUE_W[0]:.1f}")
print()
# How many steps did every token get a non-zero gradient?
print("Non-zero gradient steps out of", N_STEPS)
for i, token in enumerate(TOKENS):
pct = sgd_grad_counts[i] / N_STEPS * 100
bar = "█" * int(pct / 2)
print(f" {token:<16} {bar:<50} {pct:.1f}%")
print()
Effective Learning Rate
To perceive why Adam succeeds on uncommon tokens, we study its efficient studying price for every parameter on the finish of coaching. Adam doesn’t use the identical replace scale for each weight. Instead, every parameter’s replace is split by the sq. root of its accrued variance estimate vvv. This means the sensible step dimension depends upon how massive or small that variance has change into throughout coaching.
The numbers reveal a transparent sample. Common tokens akin to “the” and “mannequin” accumulate massive variance values as a result of they obtain gradients nearly each step, so their efficient studying charges stay comparatively small. Rare tokens behave very otherwise. Since “xenobiotic” and “thalweg” obtain gradients solely sometimes, their variance estimates keep tiny, inflicting Adam to robotically amplify their efficient studying charges by an enormous quantity. Even although the nominal studying price is fastened at 0.05, the rarest token finally ends up receiving an efficient step dimension above 40. This adaptive scaling is the core motive Adam can be taught sparse parameters that SGD fails to optimize correctly.
eps = 1e-8
adam_v_final = adam_v_history[-1]
effective_lr = LR / (np.sqrt(adam_v_final) + eps)
print("=" * 55)
print("Adam Effective Learning Rate (remaining step)")
print("=" * 55)
for i, token in enumerate(TOKENS):
print(f" {token:<16} v_hat={adam_v_final[i]:.6f} lr_eff={effective_lr[i]:.4f}")
print()
print(f"Nominal LR: {LR}")
print("Rare tokens get an robotically amplified efficient LR.")
print()
Visualizing the Results
Finally, we visualize the complete coaching dynamics to check how SGD and Adam behave throughout tokens with vastly totally different frequencies. The first two plots observe the burden trajectories throughout coaching, exhibiting whether or not every optimizer can transfer rare-token parameters towards the right worth. We additionally evaluate the ultimate weight errors for each token to measure total studying high quality.
The 4 charts inform a single story throughout two optimizers. The top-left reveals SGD’s weight trajectories: frequent tokens (darkish and medium blue) shoot as much as 1.0 throughout the first few hundred steps, whereas the 2 uncommon tokens — xenobiotic and thalweg — barely go away the ground, crawling to 0.53 and 0.15 respectively in any case 3,000 steps. The top-right bar chart makes the harm concrete: SGD’s error bars for xenobiotic and thalweg dwarf every thing else, whereas Adam’s blue bars keep uniformly small throughout all six tokens.
The bottom-left reveals Adam’s trajectories — all six tokens converge to 1.0, together with the uncommon ones, although with extra oscillation as a result of every uncommon gradient replace carries a big amplified step. The bottom-right explains why: plotted on a log-log scale, the connection between token frequency and Adam’s efficient studying price is a clear inverse — thalweg sits on the top-left with a 41× amplified efficient LR, “the” sits on the bottom-right close to the nominal 0.05, and each different token falls on the identical diagonal. Adam didn’t obtain any particular directions about which tokens had been uncommon; the variance time period computed it robotically from gradient historical past alone.
BG = "#fafaf8"
DARK = "#1a1a1a"
# Color ramp: blue for frequent tokens, crimson for uncommon
TOKEN_COLORS = ["#1a5276", "#2471a3", "#5dade2", "#e67e22", "#c0392b", "#7d2a2a"]
steps = np.arange(N_STEPS)
fig = plt.determine(figsize=(16, 11), facecolor=BG)
fig.suptitle(
"SGD vs. Adam on Rare Tokens -- Frequency Bias and Variance Normalization",
fontsize=14, fontweight="daring", shade=DARK, y=0.99
)
gs = gridspec.GridSpec(2, 3, determine=fig, hspace=0.45, wspace=0.35)
# ── 1. SGD weight trajectories ────────────────────────────────
ax1 = fig.add_subplot(gs[0, :2])
ax1.set_facecolor(BG)
ax1.axhline(1.0, shade=DARK, lw=1, ls="--", alpha=0.3, label="True weight = 1.0")
for i, (token, shade) in enumerate(zip(TOKENS, TOKEN_COLORS)):
ax1.plot(steps, sgd_history[:, i], shade=shade, lw=1.8,
label=f"{token} (freq={FREQ[i]:.3f})")
ax1.set_title("SGD -- Weight TrajectoriesnRare tokens barely transfer from zero", fontsize=11, shade=DARK)
ax1.set_xlabel("Training Step", fontsize=9)
ax1.set_ylabel("Learned Weight", fontsize=9)
ax1.legend(fontsize=8, loc="proper")
ax1.set_ylim(-0.3, 1.6)
ax1.spines[["top", "right"]].set_visible(False)
# Annotate failure zone
ax1.annotate(
"Rare tokens stucknnear zero",
xy=(N_STEPS * 0.95, sgd_history[-1, 5]),
xytext=(N_STEPS * 0.65, -0.15),
fontsize=8.5, shade="#c0392b",
arrowprops=dict(arrowstyle="->", shade="#c0392b", lw=1.2),
bbox=dict(boxstyle="spherical,pad=0.3", facecolor="#fff0f0", edgecolor="#c0392b", alpha=0.85)
)
# ── 2. Final weight error bar chart ───────────────────────────
ax2 = fig.add_subplot(gs[0, 2])
ax2.set_facecolor(BG)
x = np.arange(6)
w_sgd = sgd_final
w_adam = adam_final
width = 0.35
bars_sgd = ax2.bar(x - width/2, np.abs(w_sgd - TRUE_W), width, shade="#c0392b", alpha=0.85, label="SGD error")
bars_adam = ax2.bar(x + width/2, np.abs(w_adam - TRUE_W), width, shade="#2980b9", alpha=0.85, label="Adam error")
ax2.set_xticks(x)
ax2.set_xticklabels([t[:8] for t in TOKENS], rotation=30, ha="proper", fontsize=8)
ax2.set_ylabel("|realized w − true w|", fontsize=9)
ax2.set_title("Final Weight Errorn(decrease = higher)", fontsize=11, shade=DARK)
ax2.legend(fontsize=8)
ax2.spines[["top", "right"]].set_visible(False)
# ── 3. Adam weight trajectories ───────────────────────────────
ax3 = fig.add_subplot(gs[1, :2])
ax3.set_facecolor(BG)
ax3.axhline(1.0, shade=DARK, lw=1, ls="--", alpha=0.3, label="True weight = 1.0")
for i, (token, shade) in enumerate(zip(TOKENS, TOKEN_COLORS)):
ax3.plot(steps, adam_history[:, i], shade=shade, lw=1.8,
label=f"{token} (freq={FREQ[i]:.3f})")
ax3.set_title("Adam -- Weight TrajectoriesnRare tokens converge through variance normalization", fontsize=11, shade=DARK)
ax3.set_xlabel("Training Step", fontsize=9)
ax3.set_ylabel("Learned Weight", fontsize=9)
ax3.legend(fontsize=8, loc="proper")
ax3.set_ylim(-0.3, 1.6)
ax3.spines[["top", "right"]].set_visible(False)
ax3.annotate(
"Rare tokens convergendespite sparse gradients",
xy=(N_STEPS * 0.95, adam_history[-1, 5]),
xytext=(N_STEPS * 0.60, 0.3),
fontsize=8.5, shade="#27ae60",
arrowprops=dict(arrowstyle="->", shade="#27ae60", lw=1.2),
bbox=dict(boxstyle="spherical,pad=0.3", facecolor="#f0fff4", edgecolor="#27ae60", alpha=0.85)
)
# ── 4. Effective LR vs frequency ─────────────────────────────
ax4 = fig.add_subplot(gs[1, 2])
ax4.set_facecolor(BG)
ax4.scatter(FREQ, effective_lr, c=TOKEN_COLORS, s=120, zorder=5, edgecolors="white", lw=1.5)
for i, token in enumerate(TOKENS):
ax4.annotate(token, (FREQ[i], effective_lr[i]),
textcoords="offset factors", xytext=(6, 4), fontsize=7.5, shade=TOKEN_COLORS[i])
ax4.axhline(LR, shade=DARK, lw=1, ls="--", alpha=0.4)
ax4.textual content(0.5, LR * 1.05, f"Nominal LR = {LR}", fontsize=8, shade=DARK, alpha=0.6)
ax4.set_xscale("log")
ax4.set_yscale("log")
ax4.set_xlabel("Token Frequency (log scale)", fontsize=9)
ax4.set_ylabel("Adam Effective LR lr/√v̂ (log scale)", fontsize=9)
ax4.set_title("Adam's Automatic EqualizernRare tokens get amplified LR", fontsize=11, shade=DARK)
ax4.spines[["top", "right"]].set_visible(False)
plt.savefig("sgd_vs_adam.png", dpi=150, bbox_inches="tight", facecolor=BG)
plt.present()


Check out the Codes with Notebook. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us
The put up Stochastic Gradient Descent (SGD’s) Frequency Bias and How Adam Fixes It appeared first on MarkTechPost.
