Why Gradient Descent Zigzags and How Momentum Fixes It

Gradient descent has a basic limitation: on most real-world loss surfaces, it’s inefficient. When the floor has uneven curvature—steep in a single path and flat in one other, which is frequent in observe—the algorithm struggles to make constant progress. A excessive studying charge helps transfer quicker alongside the flat path however causes overshooting and oscillations alongside the steep path. Reducing the educational charge stabilizes the updates however considerably slows convergence. This trade-off isn’t uncommon; it’s typical habits for traditional gradient descent.

Momentum addresses this challenge by incorporating data from previous gradients. Instead of relying solely on the present gradient, it maintains a operating common (usually known as velocity) and updates parameters based mostly on this amassed path. As a outcome, constant gradients reinforce one another, permitting quicker motion throughout flat areas, whereas oscillating gradients are likely to cancel out, decreasing instability.

In this text, we stroll by means of precisely how this works: the replace equations, and a from-scratch simulation on a managed anisotropic floor that lets us measure the distinction exactly — 185 steps for vanilla GD versus 159 for Momentum, with β=0.99 failing to converge completely.

Setting up the dependencies

Copy Code

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

Defining the Loss Surface

The loss floor is a stretched bowl — flat alongside one axis, steep alongside the opposite. This is managed by the 2 coefficients: 0.05 in x makes that path practically flat, whereas 5 in y makes it steep. The gradients mirror this straight — 0.1·x within the flat path, 10·y within the steep one.

The Hessian of this floor is diagonal with eigenvalues 0.1 and 10, giving a situation variety of 100. That quantity is the core of the issue: it tells you the floor is 100× extra curved in a single path than the opposite, which is what forces GD into its zigzag habits.

The studying charge of 0.18 is chosen intentionally. The stability restrict for GD is 2 / λ_max = 2 / 10 = 0.2 — any greater and the optimizer diverges outright. At 0.18, the steep axis replace issue is |1 − 10 × 0.18| = 0.8, that means the optimizer overshoots and reverses path each single step. The flat axis issue is |1 − 0.1 × 0.18| = 0.982, that means it recovers just one.8% of the remaining distance per step. This is the worst-case mixture Momentum is constructed for: oscillation in a single path, near-stagnation within the different.

Copy Code

def loss(x, y):
    return 0.05 * x**2 + 5 * y**2
 
def grad(x, y):
    return np.array([0.1 * x, 10 * y])

Optimizers

Both strategies comply with the identical total course of: begin from an preliminary place, take a set variety of steps, and monitor how the place modifications. The key distinction lies in how every step is computed.

Vanilla gradient descent may be very easy. At every step, it updates the place by subtracting the gradient scaled by the educational charge. It doesn’t keep in mind something from earlier steps. This is why oscillations happen—if the gradient path retains altering (up, then down), the updates merely comply with that sample with no mechanism to easy it out.

Momentum introduces one further time period: velocity (v), which is initially zero. Instead of utilizing solely the present gradient, it updates this velocity by combining the earlier velocity and the brand new gradient. The parameter β controls how a lot weight is given to previous data. The next β (e.g., 0.9) means the replace depends extra on previous gradients, whereas a decrease β makes it behave extra like normal gradient descent.

This averaging impact behaves in another way throughout instructions. In steep instructions, the place gradients ceaselessly change signal, the updates are likely to cancel one another out, decreasing oscillations. In flatter instructions, the place gradients are extra constant, they accumulate over time, permitting the optimizer to maneuver quicker.

Finally, the place is up to date utilizing this velocity as an alternative of the uncooked gradient. This leads to smoother and extra secure progress towards the minimal.

Copy Code

def gradient_descent(begin, lr, steps=300):
    """
    Vanilla GD:  θ ← θ − lr · ∇L(θ)
 
    Each replace relies upon solely on the present gradient.
    No reminiscence of previous gradients -- oscillations persist.
    """
    path = [np.array(start, dtype=float)]
    pos  = np.array(begin, dtype=float)
    for _ in vary(steps):
        pos = pos - lr * grad(*pos)
        path.append(pos.copy())
    return np.array(path)
 
 
def momentum_gd(begin, lr, beta, steps=300):
    """
    Momentum GD:
        v ← β·v + (1−β)·∇L(θ)
        θ ← θ − lr·v
 
    v is a weighted operating common of previous gradients (exponential transferring avg).
 
    Why it helps:
      - In y: gradients alternate signal → they cancel in v → oscillations damped.
      - In x: gradients share the identical signal → they accumulate in v → quicker steps.
 
    β controls reminiscence size. High β → longer reminiscence → extra smoothing (and danger
    of overshooting). Low β → shorter reminiscence → nearer to vanilla GD.
    """
    path = [np.array(start, dtype=float)]
    pos  = np.array(begin, dtype=float)
    v    = np.zeros(2)
    for _ in vary(steps):
        g   = grad(*pos)
        v   = beta * v + (1 - beta) * g
        pos = pos - lr * v
        path.append(pos.copy())
    return np.array(path)

Running all three situations

All three experiments begin from the identical level (−4.0, 1.5), use the identical studying charge, and run for 300 steps. The solely distinction is using momentum and the worth of β. Instead of simply recording the ultimate place, the complete trajectory is saved for every run, which permits us to investigate how the optimizer strikes over time. Vanilla gradient descent progresses slowly with a zigzag sample and reaches a remaining lack of 0.000015. Momentum with β = 0.90 performs extra effectively, decreasing oscillations and constructing velocity in the correct path, in the end reaching a decrease lack of 0.000001 inside the similar variety of steps.

However, momentum is delicate to the selection of β. When β is ready too excessive (e.g., 0.99), the optimizer accumulates extreme velocity with little or no decay. This results in overshooting the minimal and failing to stabilize, leading to a a lot greater remaining lack of 0.487363 even after 300 steps. In this case, the optimizer successfully retains circling the minimal with out converging. These outcomes spotlight that whereas momentum can considerably enhance convergence, it should be fastidiously tuned—too little affords no profit over normal gradient descent, whereas an excessive amount of introduces instability.

Copy Code

START = [-4.0, 1.5]
LR    = 0.18
STEPS = 300
 
path_gd        = gradient_descent(START, lr=LR,           steps=STEPS)
path_mom_good  = momentum_gd(START,      lr=LR, beta=0.90, steps=STEPS)
path_mom_large = momentum_gd(START,      lr=LR, beta=0.99, steps=STEPS)
 
print(f"Vanilla GD        -- remaining loss: {loss(*path_gd[-1]):.6f}")
print(f"Momentum β=0.90   -- remaining loss: {loss(*path_mom_good[-1]):.6f}")
print(f"Momentum β=0.99   -- remaining loss: {loss(*path_mom_large[-1]):.6f}  ← diverges")

Visualizing the Result

The visualization is split into two components. The high row reveals the primary 55 steps of every optimizer plotted on the contour map of the loss floor, making it simpler to look at motion patterns with out litter. The backside row shows the complete 300-step loss curves on a log scale, permitting a transparent comparability of convergence velocity over your complete run.

From the contour plots, the habits is straight away clear. Vanilla gradient descent oscillates closely alongside one path, making very gradual progress towards the minimal. Momentum with β = 0.90 stabilizes these oscillations and follows a smoother, extra direct path. In distinction, β = 0.99 results in persistent bouncing with virtually no progress. The loss curves verify this: vanilla GD decreases steadily however slowly, β = 0.90 converges quicker and extra effectively, whereas β = 0.99 reveals repeated spikes resulting from overshooting and fails to converge inside the given steps.

Copy Code

PLOT_STEPS = 55
 
x_ = np.linspace(-5, 5, 500)
y_ = np.linspace(-2.2, 2.2, 500)
X, Y = np.meshgrid(x_, y_)
Z    = loss(X, Y)
 
fig = plt.determine(figsize=(16, 10), facecolor="#FAFAF8")
gs  = GridSpec(2, 3, determine=fig, hspace=0.45, wspace=0.38,
               left=0.07, proper=0.97, high=0.88, backside=0.08)
 
COLORS = {
    "gd":        "#E05C4B",
    "mom_good":  "#3A7CA5",
    "mom_large": "#F4A536",
    "contour":   "#D4C9B8",
    "minima":    "#2A9D5C",
    "begin":     "#444444",
}
 
PANEL_TITLES = [
    "Vanilla Gradient DescentnOscillates, slow  (185 steps to converge)",
    "Momentum  β = 0.90nSmooth, fast  (159 steps to converge)",
    "Momentum  β = 0.99 (too large)nOvershoots -- never converges",
]
 
paths_plot = [
    path_gd[:PLOT_STEPS+1],
    path_mom_good[:PLOT_STEPS+1],
    path_mom_large[:PLOT_STEPS+1],
]
colours = [COLORS["gd"], COLORS["mom_good"], COLORS["mom_large"]]
 
# high row: trajectory panels
for col, (path, shade, title) in enumerate(zip(paths_plot, colours, PANEL_TITLES)):
    ax = fig.add_subplot(gs[0, col])
    ax.set_facecolor("#F5F3EE")
 
    ranges = np.geomspace(0.005, 3.5, 28)
    ax.contour(X, Y, Z, ranges=ranges, colours=COLORS["contour"],
               linewidths=0.7, alpha=0.9)
 
    ax.plot(path[:, 0], path[:, 1], shade=shade, lw=1.8, alpha=0.85, zorder=3)
    ax.scatter(path[:, 0], path[:, 1], shade=shade, s=18, zorder=4, alpha=0.6)
 
    ax.scatter(*path[0],  marker="o", s=90,  shade=COLORS["start"],  zorder=5, label="begin")
    ax.scatter(*path[-1], marker="*", s=120, shade=COLORS["minima"], zorder=5, label="finish")
    ax.scatter(0, 0, marker="+", s=200, shade=COLORS["minima"], linewidths=2.5, zorder=6)
 
    ax.set_xlim(-5, 5)
    ax.set_ylim(-2.2, 2.2)
    ax.set_title(title, fontsize=9.5, fontweight="daring", shade="#222", pad=7, loc="left")
    ax.set_xlabel("θ₁  (gradual path)", fontsize=8, shade="#666")
    ax.set_ylabel("θ₂  (quick path)", fontsize=8, shade="#666")
    ax.tick_params(labelsize=7, colours="#888")
    for backbone in ax.spines.values():
        backbone.set_edgecolor("#CCCCCC")
 
# bottom-left: loss curves (full 300 steps)
ax_loss = fig.add_subplot(gs[1, :2])
ax_loss.set_facecolor("#F5F3EE")
 
full_paths  = [path_gd, path_mom_good, path_mom_large]
full_labels = ["Vanilla GD  (185 steps)", "Momentum β=0.90  (159 steps)", "Momentum β=0.99  (diverges)"]
 
for path, shade, label in zip(full_paths, colours, full_labels):
    losses = [loss(*p) for p in path]
    steps_range = np.arange(len(path))
    ax_loss.plot(steps_range, losses, shade=shade, lw=2, label=label, alpha=0.9)
 
ax_loss.axhline(0.001, shade="#999", lw=1, ls="--", alpha=0.6)
ax_loss.textual content(305, 0.001, "convergencenthreshold", fontsize=7, shade="#888", va="middle")
 
ax_loss.set_yscale("log")
ax_loss.set_xlim(0, STEPS)
ax_loss.set_title("Loss vs. Optimisation Step  (log scale, 300 steps)",
                  fontsize=10.5, fontweight="daring", shade="#222", loc="left")
ax_loss.set_xlabel("Step", fontsize=9, shade="#666")
ax_loss.set_ylabel("Loss  f(θ)", fontsize=9, shade="#666")
ax_loss.legend(fontsize=8.5, framealpha=0.6)
ax_loss.tick_params(labelsize=8, colours="#888")
for backbone in ax_loss.spines.values():
    backbone.set_edgecolor("#CCCCCC")
 
# bottom-right: annotation panel
ax_ann = fig.add_subplot(gs[1, 2])
ax_ann.set_facecolor("#F5F3EE")
ax_ann.axis("off")
 
annotation = (
    "Update rulesnn"
    "Vanilla GDn"
    "  θ ← θ − α·∇L(θ)nn"
    "Momentum GDn"
    "  v ← β·v + (1−β)·∇L(θ)n"
    "  θ ← θ − α·vnn"
    "Key intuitionn"
    "  v accumulates previous gradients.n"
    "  Vertical oscillations cancel out.n"
    "  Horizontal steps compound.nn"
    "Hyperparameter βn"
    "  β → 0  :  behaves like GDn"
    "  β = 0.9:  typical candy spotn"
    "  β → 1  :  overshoots / diverges"
)
ax_ann.textual content(0.05, 0.97, annotation, remodel=ax_ann.transAxes,
            fontsize=8.8, va="high", ha="left",
            fontfamily="monospace", shade="#333", linespacing=1.7)
 
fig.suptitle("Momentum in Gradient Descent",
             fontsize=16, fontweight="daring", shade="#111", y=0.95)
 
plt.savefig("momentum_explainer.png", dpi=150, bbox_inches="tight",
            facecolor=fig.get_facecolor())
plt.present()

β sensitivity sweep

The experiment runs the momentum optimizer a number of instances with totally different β values, every for as much as 500 steps. For each run, it checks when the loss first drops under 0.001 and data that step because the convergence level. β = 0 serves as a baseline, because it removes the impact of momentum and behaves precisely like vanilla gradient descent.

The outcomes present a transparent sample. As β will increase from 0.0 to 0.95, convergence steadily improves, with fewer steps wanted every time. This occurs as a result of greater β values higher easy out oscillations and construct helpful momentum in the correct path. However, at β = 0.99, efficiency drops sharply. The optimizer turns into too gradual to regulate as a result of it depends too closely on previous gradients, resulting in extreme overshooting and delayed convergence. Overall, this creates an inverted U-shaped relationship: average β values (round 0.9–0.95) give the perfect efficiency, whereas values which can be too excessive can damage convergence considerably.

Copy Code

THRESHOLD = 0.001
betas = [0.0, 0.5, 0.7, 0.85, 0.90, 0.95, 0.99]
 
print(f"n── β Sensitivity  (steps to loss < {THRESHOLD}) ───────────────")
print(f"{'β':>6}  {'steps':>10}  word")
print("─" * 46)
 
for b in betas:
    path = momentum_gd(START, lr=LR, beta=b, steps=500)
    losses = [loss(*p) for p in path]
    hit = subsequent((i for i, l in enumerate(losses) if l < THRESHOLD), None)
    word = ""
    if b == 0.0:  word = "← equal to vanilla GD"
    elif b == 0.90: word = "← typical candy spot"
    elif b == 0.99: word = "← overshoots / diverges"
    standing = f"{hit:>6} steps" if hit else "  didn't converge"
    print(f"{b:>6.2f}  {standing}  {word}")

Check out the Codes with Notebook here. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Why Gradient Descent Zigzags and How Momentum Fixes It appeared first on MarkTechPost.

Why Gradient Descent Zigzags and How Momentum Fixes It

Setting up the dependencies

Defining the Loss Surface

Optimizers

Running all three situations

Visualizing the Result

β sensitivity sweep

A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment

MCP Team Launches the Preview Version of the ‘MCP Registry’: A Federated Discovery Layer for Enterprise AI

A Coding Guide to End-to-End Robotics Learning with LeRobot: Training, Evaluating, and Visualizing Behavior Cloning Policies on PushT

Meet ARGUS: A Scalable AI Framework for Training Large Recommender Transformers to One Billion Parameters

AI-Driven Antitrust and Competition Law: Algorithmic Collusion, Self-Learning Pricing Tools, and Legal Challenges in the US and EU

TabArena: Benchmarking Tabular Machine Learning with Reproducibility and Ensembling at Scale

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Setting up the dependencies

Defining the Loss Surface

Optimizers

Running all three situations

Visualizing the Result

β sensitivity sweep

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!