Focal Loss vs Binary Cross-Entropy: A Practical Guide for Imbalanced Classification

Binary cross-entropy (BCE) is the default loss operate for binary classification—however it breaks down badly on imbalanced datasets. The cause is delicate however necessary: BCE weighs errors from each courses equally, even when one class is extraordinarily uncommon.

Imagine two predictions: a minority-class pattern with true label 1 predicted at 0.3, and a majority-class pattern with true label 0 predicted at 0.7. Both produce the identical BCE worth: −log(0.3). But ought to these two errors be handled equally? In an imbalanced dataset, positively not—the error on the minority pattern is way extra pricey.

This is precisely the place Focal Loss is available in. It reduces the contribution of simple, assured predictions and amplifies the impression of inauspicious, minority-class examples. As a consequence, the mannequin focuses much less on the overwhelmingly simple majority class and extra on the patterns that really matter. Check out the FULL CODES here.

In this tutorial, we display this impact by coaching two equivalent neural networks on a dataset with a 99:1 imbalance ratio—one utilizing BCE and the opposite utilizing Focal Loss—and evaluating their conduct, determination areas, and confusion matrices. Check out the FULL CODES here.

Installing the dependencies

Copy Code

pip set up numpy pandas matplotlib scikit-learn torch

Creating an Imbalanced Dataset

We create an artificial binary classification dataset with a 99:1 imbalance with 6000 samples utilizing make_classification. This ensures that the majority samples belong to the bulk class, making it a really perfect setup to display why BCE struggles and the way Focal Loss helps. Check out the FULL CODES here.

Copy Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

# Generate imbalanced dataset
X, y = make_classification(
    n_samples=6000,
    n_features=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.99, 0.01],   
    class_sep=1.5,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_test  = torch.tensor(X_test,  dtype=torch.float32)
y_test  = torch.tensor(y_test,  dtype=torch.float32).unsqueeze(1)

Creating the Neural Network

We outline a easy neural community with two hidden layers to maintain the experiment light-weight and centered on the loss capabilities. This small structure is enough to study the choice boundary in our 2D dataset whereas clearly highlighting the variations between BCE and Focal Loss. Check out the FULL CODES here.

Copy Code

class SimpleNN(nn.Module):
    def __init__(self):
        tremendous().__init__()
        self.layers = nn.Sequential(
            nn.Linear(2, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid()
        )
    def ahead(self, x):
        return self.layers(x)

Focal Loss Implementation

This class implements the Focal Loss operate, which modifies binary cross-entropy by down-weighting simple examples and focusing the coaching on laborious, misclassified samples. The gamma time period controls how aggressively simple samples are suppressed, whereas alpha assigns increased weight to the minority class. Together, they assist the mannequin study higher on imbalanced datasets. Check out the FULL CODES here.

Copy Code

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2):
        tremendous().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def ahead(self, preds, targets):
        eps = 1e-7
        preds = torch.clamp(preds, eps, 1 - eps)
        
        pt = torch.the place(targets == 1, preds, 1 - preds)
        loss = -self.alpha * (1 - pt) ** self.gamma * torch.log(pt)
        return loss.imply()

Training the Model

We outline a easy coaching loop that optimizes the mannequin utilizing the chosen loss operate and evaluates accuracy on the take a look at set. We then prepare two equivalent neural networks — one with normal BCE loss and the opposite with Focal Loss — permitting us to straight examine how every loss operate performs on the identical imbalanced dataset. The printed accuracies spotlight the efficiency hole between BCE and Focal Loss.

Although BCE exhibits a really excessive accuracy (98%), that is deceptive as a result of the dataset is closely imbalanced — predicting virtually all the pieces as the bulk class nonetheless yields excessive accuracy. Focal Loss, alternatively, improves minority-class detection, which is why its barely increased accuracy (99%) is way extra significant on this context. Check out the FULL CODES here.

Copy Code

def prepare(mannequin, loss_fn, lr=0.01, epochs=30):
    choose = optim.Adam(mannequin.parameters(), lr=lr)

    for _ in vary(epochs):
        preds = mannequin(X_train)
        loss = loss_fn(preds, y_train)
        choose.zero_grad()
        loss.backward()
        choose.step()

    with torch.no_grad():
        test_preds = mannequin(X_test)
        test_acc = ((test_preds > 0.5).float() == y_test).float().imply().merchandise()
    return test_acc, test_preds.squeeze().detach().numpy()

# Models
model_bce = SimpleNN()
model_focal = SimpleNN()

acc_bce, preds_bce = prepare(model_bce, nn.BCELoss())
acc_focal, preds_focal = prepare(model_focal, FocalLoss(alpha=0.25, gamma=2))

print("Test Accuracy (BCE):", acc_bce)
print("Test Accuracy (Focal Loss):", acc_focal)

Plotting the Decision Boundary

The BCE mannequin produces an virtually flat determination boundary that predicts solely the bulk class, fully ignoring the minority samples. This occurs as a result of, in an imbalanced dataset, BCE is dominated by the majority-class examples and learns to categorise almost all the pieces as that class. In distinction, the Focal Loss mannequin exhibits a way more refined and significant determination boundary, efficiently figuring out extra minority-class areas and capturing patterns BCE fails to study. Check out the FULL CODES here.

Copy Code

def plot_decision_boundary(mannequin, title):
    # Create a grid
    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, 300),
        np.linspace(y_min, y_max, 300)
    )
    grid = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
    with torch.no_grad():
        Z = mannequin(grid).reshape(xx.form)

    # Plot
    plt.contourf(xx, yy, Z, ranges=[0,0.5,1], alpha=0.4)
    plt.scatter(X[:,0], X[:,1], c=y, cmap='coolwarm', s=10)
    plt.title(title)
    plt.present()

plot_decision_boundary(model_bce, "Decision Boundary -- BCE Loss")
plot_decision_boundary(model_focal, "Decision Boundary -- Focal Loss")

Plotting the Confusion Matrix

In the BCE mannequin’s confusion matrix, the community accurately identifies only one minority-class pattern, whereas misclassifying 27 of them as majority class. This exhibits that BCE collapses towards predicting virtually all the pieces as the bulk class because of the imbalance. In distinction, the Focal Loss mannequin accurately predicts 14 minority samples and reduces misclassifications from 27 right down to 14. This demonstrates how Focal Loss locations extra emphasis on laborious, minority-class examples, enabling the mannequin to study a call boundary that really captures the uncommon class as an alternative of ignoring it. Check out the FULL CODES here.

Copy Code

from sklearn.metrics import confusion_matrix, ConfusionMatrixShow

def plot_conf_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    disp = ConfusionMatrixShow(confusion_matrix=cm)
    disp.plot(cmap="Blues", values_format='d')
    plt.title(title)
    plt.present()

# Convert torch tensors to numpy
y_test_np = y_test.numpy().astype(int)

preds_bce_label   = (preds_bce > 0.5).astype(int)
preds_focal_label = (preds_focal > 0.5).astype(int)

plot_conf_matrix(y_test_np, preds_bce_label, "Confusion Matrix -- BCE Loss")
plot_conf_matrix(y_test_np, preds_focal_label, "Confusion Matrix -- Focal Loss")