Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

A deep neural community may be understood as a geometrical system, the place every layer reshapes the enter area to kind more and more complicated determination boundaries. For this to work successfully, layers should protect significant spatial info — notably how far an information level lies from these boundaries — since this distance permits deeper layers to construct wealthy, non-linear representations.

Sigmoid disrupts this course of by compressing all inputs right into a slender vary between 0 and 1. As values transfer away from determination boundaries, they develop into indistinguishable, inflicting a loss of geometric context throughout layers. This results in weaker representations and limits the effectiveness of depth.

ReLU, however, preserves magnitude for constructive inputs, permitting distance info to movement by the community. This permits deeper fashions to stay expressive with out requiring extreme width or compute.

In this text, we concentrate on this forward-pass habits — analyzing how Sigmoid and ReLU differ in sign propagation and illustration geometry utilizing a two-moons experiment, and what meaning for inference effectivity and scalability.

Setting up the dependencies

Copy Code

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from matplotlib.colours import ListedColormap
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

Copy Code

plt.rcParams.replace({
    "font.household":        "monospace",
    "axes.spines.prime":    False,
    "axes.spines.proper":  False,
    "determine.facecolor":   "white",
    "axes.facecolor":     "#f7f7f7",
    "axes.grid":          True,
    "grid.coloration":         "#e0e0e0",
    "grid.linewidth":     0.6,
})
 
T = {                          
    "bg":      "white",
    "panel":   "#f7f7f7",
    "sig":     "#e05c5c",      
    "relu":    "#3a7bd5",      
    "c0":      "#f4a261",      
    "c1":      "#2a9d8f",      
    "textual content":    "#1a1a1a",
    "muted":   "#666666",
}

Creating the dataset

To examine the impact of activation capabilities in a managed setting, we first generate an artificial dataset utilizing scikit-learn’s make_moons. This creates a non-linear, two-class drawback the place easy linear boundaries fail, making it preferrred for testing how properly neural networks study complicated determination surfaces.

We add a small quantity of noise to make the duty extra lifelike, then standardize the options utilizing StandardScaler so each dimensions are on the identical scale — making certain steady coaching. The dataset is then break up into coaching and take a look at units to guage generalization.

Finally, we visualize the information distribution. This plot serves because the baseline geometry that each Sigmoid and ReLU networks will try and mannequin, permitting us to later evaluate how every activation operate transforms this area throughout layers.

Copy Code

X, y = make_moons(n_samples=400, noise=0.18, random_state=42)
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

fig, ax = plt.subplots(figsize=(7, 5))
fig.patch.set_facecolor(T["bg"])
ax.set_facecolor(T["panel"])
ax.scatter(X[y == 0, 0], X[y == 0, 1], c=T["c0"], s=40,
           edgecolors="white", linewidths=0.5, label="Class 0", alpha=0.9)
ax.scatter(X[y == 1, 0], X[y == 1, 1], c=T["c1"], s=40,
           edgecolors="white", linewidths=0.5, label="Class 1", alpha=0.9)
ax.set_title("make_moons -- our dataset", coloration=T["text"], fontsize=13)
ax.set_xlabel("x₁", coloration=T["muted"]); ax.set_ylabel("x₂", coloration=T["muted"])
ax.tick_params(colours=T["muted"]); ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig("moons_dataset.png", dpi=140, bbox_inches="tight")
plt.present()

Creating the Network

Next, we implement a small, managed neural community to isolate the impact of activation capabilities. The purpose right here is to not construct a extremely optimized mannequin, however to create a clear experimental setup the place Sigmoid and ReLU may be in contrast beneath equivalent situations.

We outline each activation capabilities (Sigmoid and ReLU) together with their derivatives, and use binary cross-entropy because the loss since it is a binary classification activity. The TwoLayerWeb class represents a easy 3-layer feedforward community (2 hidden layers + output), the place the one configurable element is the activation operate.

A key element is the initialization technique: we use He initialization for ReLU and Xavier initialization for Sigmoid, making certain that every community begins in a good and steady regime based mostly on its activation dynamics.

The ahead go computes activations layer by layer, whereas the backward go performs commonplace gradient descent updates. Importantly, we additionally embrace diagnostic strategies like get_hidden and get_z_trace, which permit us to examine how alerts evolve throughout layers — that is essential for analyzing how a lot geometric info is preserved or misplaced.

By maintaining structure, information, and coaching setup fixed, this implementation ensures that any distinction in efficiency or inner representations may be instantly attributed to the activation operate itself — setting the stage for a transparent comparability of their impression on sign propagation and expressiveness.

Copy Code

def sigmoid(z):      return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_d(a):    return a * (1 - a)
def relu(z):         return np.most(0, z)
def relu_d(z):       return (z > 0).astype(float)
def bce(y, yhat):    return -np.imply(y * np.log(yhat + 1e-9) + (1 - y) * np.log(1 - yhat + 1e-9))

class TwoLayerWeb:
    def __init__(self, activation="relu", seed=0):
        np.random.seed(seed)
        self.act_name = activation
        self.act  = relu    if activation == "relu" else sigmoid
        self.dact = relu_d  if activation == "relu" else sigmoid_d

        # He init for ReLU, Xavier for Sigmoid
        scale = lambda fan_in: np.sqrt(2 / fan_in) if activation == "relu" else np.sqrt(1 / fan_in)
        self.W1 = np.random.randn(2, 8)  * scale(2)
        self.b1 = np.zeros((1, 8))
        self.W2 = np.random.randn(8, 8)  * scale(8)
        self.b2 = np.zeros((1, 8))
        self.W3 = np.random.randn(8, 1)  * scale(8)
        self.b3 = np.zeros((1, 1))
        self.loss_history = []

    def ahead(self, X, retailer=False):
        z1 = X  @ self.W1 + self.b1;  a1 = self.act(z1)
        z2 = a1 @ self.W2 + self.b2;  a2 = self.act(z2)
        z3 = a2 @ self.W3 + self.b3;  out = sigmoid(z3)
        if retailer:
            self._cache = (X, z1, a1, z2, a2, z3, out)
        return out

    def backward(self, lr=0.05):
        X, z1, a1, z2, a2, z3, out = self._cache
        n = X.form[0]

        dout = (out - self.y_cache) / n
        dW3 = a2.T @ dout;  db3 = dout.sum(axis=0, keepdims=True)
        da2 = dout @ self.W3.T
        dz2 = da2 * (self.dact(z2) if self.act_name == "relu" else self.dact(a2))
        dW2 = a1.T @ dz2;  db2 = dz2.sum(axis=0, keepdims=True)
        da1 = dz2 @ self.W2.T
        dz1 = da1 * (self.dact(z1) if self.act_name == "relu" else self.dact(a1))
        dW1 = X.T  @ dz1;  db1 = dz1.sum(axis=0, keepdims=True)

        for p, g in [(self.W3,dW3),(self.b3,db3),(self.W2,dW2),
                     (self.b2,db2),(self.W1,dW1),(self.b1,db1)]:
            p -= lr * g

    def train_step(self, X, y, lr=0.05):
        self.y_cache = y.reshape(-1, 1)
        out = self.ahead(X, retailer=True)
        loss = bce(self.y_cache, out)
        self.backward(lr)
        return loss

    def get_hidden(self, X, layer=1):
        """Return post-activation values for layer 1 or 2."""
        z1 = X @ self.W1 + self.b1;  a1 = self.act(z1)
        if layer == 1: return a1
        z2 = a1 @ self.W2 + self.b2; return self.act(z2)

    def get_z_trace(self, x_single):
        """Return pre-activation magnitudes per layer for ONE pattern."""
        z1 = x_single @ self.W1 + self.b1
        a1 = self.act(z1)
        z2 = a1 @ self.W2 + self.b2
        a2 = self.act(z2)
        z3 = a2 @ self.W3 + self.b3
        return [np.abs(z1).mean(), np.abs(a1).mean(),
                np.abs(z2).mean(), np.abs(a2).mean(),
                np.abs(z3).mean()]

Training the Networks

Now we prepare each networks beneath equivalent situations to make sure a good comparability. We initialize two fashions — one utilizing Sigmoid and the opposite utilizing ReLU — with the identical random seed so they begin from equal weight configurations.

The coaching loop runs for 800 epochs utilizing mini-batch gradient descent. In every epoch, we shuffle the coaching information, break up it into batches, and replace each networks in parallel. This setup ensures that the one variable altering between the 2 runs is the activation operate.

We additionally observe the loss after each epoch and log it at common intervals. This permits us to watch how every community evolves over time — not simply in phrases of convergence pace, however whether or not it continues bettering or plateaus.

This step is crucial as a result of it establishes the primary sign of divergence: if each fashions begin identically however behave in a different way throughout coaching, that distinction should come from how every activation operate propagates and preserves info by the community.

Copy Code

EPOCHS = 800
LR     = 0.05
BATCH  = 64

net_sig  = TwoLayerWeb("sigmoid", seed=42)
net_relu = TwoLayerWeb("relu",    seed=42)

for epoch in vary(EPOCHS):
    idx = np.random.permutation(len(X_train))
    for web in [net_sig, net_relu]:
        epoch_loss = []
        for i in vary(0, len(idx), BATCH):
            b = idx[i:i+BATCH]
            loss = web.train_step(X_train[b], y_train[b], LR)
            epoch_loss.append(loss)
        web.loss_history.append(np.imply(epoch_loss))

    if (epoch + 1) % 200 == 0:
        ls = net_sig.loss_history[-1]
        lr = net_relu.loss_history[-1]
        print(f"  Epoch {epoch+1:4d} | Sigmoid loss: {ls:.4f} | ReLU loss: {lr:.4f}")

print("n Training full.")

Training Loss Curve

The loss curves make the divergence between Sigmoid and ReLU very clear. Both networks begin from the identical initialization and are skilled beneath equivalent situations, but their studying trajectories shortly separate. Sigmoid improves initially however plateaus round ~0.28 by epoch 400, exhibiting virtually no progress afterward — an indication that the community has exhausted the helpful sign it may possibly extract.

ReLU, in distinction, continues to steadily cut back loss all through coaching, dropping from ~0.15 to ~0.03 by epoch 800. This isn’t simply quicker convergence; it displays a deeper concern: Sigmoid’s compression is limiting the movement of significant info, inflicting the mannequin to stall, whereas ReLU preserves that sign, permitting the community to maintain refining its determination boundary.

Copy Code

fig, ax = plt.subplots(figsize=(10, 5))
fig.patch.set_facecolor(T["bg"])
ax.set_facecolor(T["panel"])

ax.plot(net_sig.loss_history,  coloration=T["sig"],  lw=2.5, label="Sigmoid")
ax.plot(net_relu.loss_history, coloration=T["relu"], lw=2.5, label="ReLU")

ax.set_xlabel("Epoch", coloration=T["muted"])
ax.set_ylabel("Binary Cross-Entropy Loss", coloration=T["muted"])
ax.set_title("Training Loss -- similar structure, similar init, similar LRnonly the activation differs",
             coloration=T["text"], fontsize=12)
ax.legend(fontsize=11)
ax.tick_params(colours=T["muted"])

# Annotate last losses
for web, coloration, va in [(net_sig, T["sig"], "backside"), (net_relu, T["relu"], "prime")]:
    last = web.loss_history[-1]
    ax.annotate(f"  last: {last:.4f}", xy=(EPOCHS-1, last),
                coloration=coloration, fontsize=9, va=va)

plt.tight_layout()
plt.savefig("loss_curves.png", dpi=140, bbox_inches="tight")
plt.present()

Decision Boundary Plots

The determination boundary visualization makes the distinction much more tangible. The Sigmoid community learns a virtually linear boundary, failing to seize the curved construction of the two-moons dataset, which ends up in decrease accuracy (~79%). This is a direct consequence of its compressed inner representations — the community merely doesn’t have sufficient geometric sign to assemble a posh boundary.

In distinction, the ReLU community learns a extremely non-linear, well-adapted boundary that intently follows the information distribution, attaining a lot larger accuracy (~96%). Because ReLU preserves magnitude throughout layers, it permits the community to progressively bend and refine the choice floor, turning depth into precise expressive energy somewhat than wasted capability.

Copy Code

def plot_boundary(ax, web, X, y, title, coloration):
    h = 0.025
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    grid = np.c_[xx.ravel(), yy.ravel()]
    Z = web.ahead(grid).reshape(xx.form)

    # Soft shading
    cmap_bg = ListedColormap(["#fde8c8", "#c8ece9"])
    ax.contourf(xx, yy, Z, ranges=50, cmap=cmap_bg, alpha=0.85)
    ax.contour(xx, yy, Z, ranges=[0.5], colours=[color], linewidths=2)

    ax.scatter(X[y==0, 0], X[y==0, 1], c=T["c0"], s=35,
               edgecolors="white", linewidths=0.4, alpha=0.9)
    ax.scatter(X[y==1, 0], X[y==1, 1], c=T["c1"], s=35,
               edgecolors="white", linewidths=0.4, alpha=0.9)

    acc = ((web.ahead(X) >= 0.5).ravel() == y).imply()
    ax.set_title(f"{title}nTest acc: {acc:.1%}", coloration=coloration, fontsize=12)
    ax.set_xlabel("x₁", coloration=T["muted"]); ax.set_ylabel("x₂", coloration=T["muted"])
    ax.tick_params(colours=T["muted"])

fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
fig.patch.set_facecolor(T["bg"])
fig.suptitle("Decision Boundaries discovered on make_moons",
             fontsize=13, coloration=T["text"])

plot_boundary(axes[0], net_sig,  X_test, y_test, "Sigmoid", T["sig"])
plot_boundary(axes[1], net_relu, X_test, y_test, "ReLU",    T["relu"])

plt.tight_layout()
plt.savefig("decision_boundaries.png", dpi=140, bbox_inches="tight")
plt.present()

Layer-by-Layer Signal Trace

This chart tracks how the sign evolves throughout layers for some extent removed from the choice boundary — and it clearly reveals the place Sigmoid fails. Both networks begin with related pre-activation magnitude on the first layer (~2.0), however Sigmoid instantly compresses it to ~0.3, whereas ReLU retains a better worth. As we transfer deeper, Sigmoid continues to squash the sign right into a slender band (0.5–0.6), successfully erasing significant variations. ReLU, however, preserves and amplifies magnitude, with the ultimate layer reaching values as excessive as 9–20.

This means the output neuron within the ReLU community is making selections based mostly on a powerful, well-separated sign, whereas the Sigmoid community is compelled to categorise utilizing a weak, compressed one. The key takeaway is that ReLU preserves distance from the choice boundary throughout layers, permitting that info to compound, whereas Sigmoid progressively destroys it.

Copy Code

far_class0 = X_train[y_train == 0][np.argmax(
    np.linalg.norm(X_train[y_train == 0] - [-1.2, -0.3], axis=1)
)]
far_class1 = X_train[y_train == 1][np.argmax(
    np.linalg.norm(X_train[y_train == 1] - [1.2, 0.3], axis=1)
)]

stage_labels = ["z₁ (pre)", "a₁ (post)", "z₂ (pre)", "a₂ (post)", "z₃ (out)"]
x_pos = np.arange(len(stage_labels))

fig, axes = plt.subplots(1, 2, figsize=(13, 5.5))
fig.patch.set_facecolor(T["bg"])
fig.suptitle("Layer-by-layer sign magnitude -- some extent removed from the boundary",
             fontsize=12, coloration=T["text"])

for ax, pattern, title in zip(
    axes,
    [far_class0, far_class1],
    ["Class 0 sample (deep in its moon)", "Class 1 sample (deep in its moon)"]
):
    ax.set_facecolor(T["panel"])
    sig_trace  = net_sig.get_z_trace(pattern.reshape(1, -1))
    relu_trace = net_relu.get_z_trace(pattern.reshape(1, -1))

    ax.plot(x_pos, sig_trace,  "o-", coloration=T["sig"],  lw=2.5, markersize=8, label="Sigmoid")
    ax.plot(x_pos, relu_trace, "s-", coloration=T["relu"], lw=2.5, markersize=8, label="ReLU")

    for i, (s, r) in enumerate(zip(sig_trace, relu_trace)):
        ax.textual content(i, s - 0.06, f"{s:.3f}", ha="heart", fontsize=8, coloration=T["sig"])
        ax.textual content(i, r + 0.04, f"{r:.3f}", ha="heart", fontsize=8, coloration=T["relu"])

    ax.set_xticks(x_pos); ax.set_xticklabels(stage_labels, coloration=T["muted"], fontsize=9)
    ax.set_ylabel("Mean |activation|", coloration=T["muted"])
    ax.set_title(title, coloration=T["text"], fontsize=11)
    ax.tick_params(colours=T["muted"]); ax.legend(fontsize=10)

plt.tight_layout()
plt.savefig("signal_trace.png", dpi=140, bbox_inches="tight")
plt.present()

Hidden Space Scatter

This is an important visualization as a result of it instantly exposes how every community makes use of (or fails to make use of) depth. In the Sigmoid community (left), each lessons collapse into a decent, overlapping area — a diagonal smear the place factors are closely entangled. The commonplace deviation truly decreases from layer 1 (0.26) to layer 2 (0.19), that means the illustration is changing into much less expressive with depth. Each layer is compressing the sign additional, stripping away the spatial construction wanted to separate the lessons.

ReLU reveals the other habits. In layer 1, whereas some neurons are inactive (the “useless zone”), the energetic ones already unfold throughout a wider vary (1.15 std), indicating preserved variation. By layer 2, this expands even additional (1.67 std), and the lessons develop into clearly separable — one is pushed to excessive activation ranges whereas the opposite stays close to zero. At this level, the output layer’s job is trivial.

Copy Code

fig, axes = plt.subplots(2, 2, figsize=(13, 10))
fig.patch.set_facecolor(T["bg"])
fig.suptitle("Hidden-space representations on make_moons take a look at set",
             fontsize=13, coloration=T["text"])

for col, (web, coloration, title) in enumerate([
    (net_sig,  T["sig"],  "Sigmoid"),
    (net_relu, T["relu"], "ReLU"),
]):
    for row, layer in enumerate([1, 2]):
        ax = axes[row][col]
        ax.set_facecolor(T["panel"])
        H = web.get_hidden(X_test, layer=layer)

        ax.scatter(H[y_test==0, 0], H[y_test==0, 1], c=T["c0"], s=40,
                   edgecolors="white", linewidths=0.4, alpha=0.85, label="Class 0")
        ax.scatter(H[y_test==1, 0], H[y_test==1, 1], c=T["c1"], s=40,
                   edgecolors="white", linewidths=0.4, alpha=0.85, label="Class 1")

        unfold = H.std()
        ax.textual content(0.04, 0.96, f"std: {unfold:.4f}",
                remodel=ax.transAxes, fontsize=9, va="prime",
                coloration=T["text"],
                bbox=dict(boxstyle="spherical,pad=0.3", fc="white", ec=coloration, alpha=0.85))

        ax.set_title(f"{title}  --  Layer {layer} hidden area",
                     coloration=coloration, fontsize=11)
        ax.set_xlabel(f"Unit 1", coloration=T["muted"])
        ax.set_ylabel(f"Unit 2", coloration=T["muted"])
        ax.tick_params(colours=T["muted"])
        if row == 0 and col == 0: ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig("hidden_space.png", dpi=140, bbox_inches="tight")
plt.present()

Check out the Full Codes here. Also, be at liberty to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context appeared first on MarkTechPost.

Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context

Setting up the dependencies

Creating the dataset

Creating the Network

Training the Networks

Training Loss Curve

Decision Boundary Plots

Layer-by-Layer Signal Trace

Hidden Space Scatter

Google AI Releases Veo 3.1 Lite: Giving Developers Low Cost High Speed Video Generation via The Gemini API

Liquid AI Releases LFM2-VL: Super-Fast, Open-Weight Vision-Language Models Designed for Low-Latency and Device-Aware Deployment

Alibaba Open-Sources Zvec: An Embedded Vector Database Bringing SQLite-like Simplicity and High-Performance On-Device RAG to Edge Applications

Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained on 28T Tokens with Scaled Reinforcement Learning

Software Frameworks Optimized for GPUs in AI: CUDA, ROCm, Triton, TensorRT—Compiler Paths and Performance Implications

MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Setting up the dependencies

Creating the dataset

Creating the Network

Training the Networks

Training Loss Curve

Decision Boundary Plots

Layer-by-Layer Signal Trace

Hidden Space Scatter

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!