Batch-normalization

Batch-normalization

Take care of your batches

#️⃣   ⌛  ~1.5 h 🗿  Beginner

24.11.2023

upd:

#85

Batch-normalization

Take care of your batches

⌛  ~1.5 h

#85

🎓 69/167

This post is a part of the Deep learning basics educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Batch normalization is a technique that emerged in 2015 — introduced in the seminal paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" (Ioffe and Szegedy, 2015) — and it radically changed the way we train modern neural networks. Essentially, the goal is to stabilize and speed up training by normalizing certain intermediate values within the neural network. Historically, data scientists and machine learning engineers had long recognized that normalizing input features (for instance, scaling each feature to have zero mean and unit variance) helps gradient-based methods converge more quickly and reliably. However, normalizing internal activations (i.e., hidden-layer activations during training) was not a common practice until batch normalization arrived on the scene.

One of the key motivations behind batch normalization is the issue of internal covariate shift. This term captures the notion that, as we train a deep network, the distribution of each layer's inputs can keep changing drastically due to updates of preceding layers' parameters. Because deeper layers continually see changing distributions, the network needs smaller learning rates and often more training steps to converge. The more layers we stack, the more amplified the effect can become — unless we do something to mitigate that shift.

Batch normalization significantly reduces these challenges. By normalizing (i.e., re-centering and re-scaling) intermediate layer activations in mini-batches, it ensures that each layer sees input data of a more stable distribution, thus alleviating the training difficulties caused by internal covariate shift. This practice enables faster convergence, permits higher learning rates, and even provides a helpful regularization effect. The normalization typically happens along the feature dimension; in the simplest case of a fully connected layer, the same kind of standardization is applied to each neuron's output across the batch.

In addition to speeding up convergence, batch normalization can also reduce the sensitivity to hyperparameter choices, make training deeper architectures more stable, and sometimes help the network generalize better. While subsequent research has refined the original concept (leading to variants like layer normalization, group normalization, instance normalization, and more), the fundamental principles of batch normalization remain relevant to a wide variety of deep learning tasks — ranging from image classification to natural language processing and beyond.

I have often found that practitioners new to deep learning do not realize how much of a performance bottleneck can be removed simply by integrating batch normalization layers in the right spots. It can be the difference between a model that saturates or explodes after a few epochs, and one that trains gracefully to state-of-the-art levels.

Given its importance, I think it is essential to dive deeper into the theory, discuss its derivation, illustrate the formulas that underlie it, and point out potential pitfalls. In this article, I will explore batch normalization from the ground up, covering:

Where it came from and how it relates to data preprocessing
The details of how it normalizes layer inputs
The role of learnable parameters within the batch normalization transform
How backpropagation flows through these normalizations
How to implement batch normalization in popular frameworks like TensorFlow, Keras, and PyTorch
Extensions and variations, such as layer normalization, instance normalization, group normalization, batch renormalization, and more
Empirical demonstrations showing the speedup in training and improvements in final accuracy

With this foundation in mind, let us begin by defining batch normalization in more detail.

2. defining batch normalization

typical workflow in neural network training

In the usual pipeline for training neural networks, one typically:

Preprocesses the training data (e.g., normalizing or standardizing each input feature so that it has zero mean and unit variance, or scaling data within [0, 1] or [-1, 1], etc.)
Divides the training data into mini-batches (also called batches) of size $m$ .
Performs forward propagation for each batch, calculates the loss, and then computes gradients by backpropagation.
Uses an optimization procedure (like stochastic gradient descent Stochastic Gradient Descent, possibly with momentum or adaptive methods) to update the parameters of the model.

Even though the raw input features are often normalized at the first step, nothing prevents the hidden layers from producing highly varying activations over the course of training. Once the parameters of the earlier layers update, the distribution of activations feeding into deeper layers can shift drastically. This phenomenon, known as internal covariate shift, is precisely what batch normalization addresses.

relation to data preprocessing techniques

In a sense, batch normalization is an automated internal data preprocessing scheme for each layer. Traditionally, you might do:

Input standardization: $x = (x - \mu_x)/\sigma_x$ for each input feature $x$ .
Possibly apply principal component analysis or other transformations.

But with batch normalization, each hidden layer's input vector is likewise standardized to have zero mean and unit variance (in a mini-batch sense), and then re-scaled by a learned gain (often denoted $\gamma$ ) plus a learned shift (denoted $\beta$ ). These learned parameters $\gamma, \beta$ allow the model to undo or modulate the normalization if that is optimal for the task. In other words, the network is not forced to keep everything strictly zero mean and unit variance; it can discover the best shift and scale for each activation dimension.

conceptual overview

While I will get into the exact formulas in the next chapter, let me emphasize that batch normalization is effectively:

Compute mean and variance of each feature dimension across a mini-batch.
Subtract the batch mean from each example's feature dimension, then divide by the batch standard deviation.
Multiply by a trainable scale $\gamma$ and add a trainable offset $\beta$ .

During training, mean and variance are typically calculated per mini-batch. During inference (test time), one usually uses running estimates of mean and variance that were collected over many training batches.

This approach (learning $\gamma$ and $\beta$ for each activation dimension) solves a big worry in older networks, where if you normalized the hidden units, you might lose representational capacity. But with these trainable parameters, the network can adapt the normalized activations to exactly what is needed for a given layer's objective.

Batch normalization is not just a detail in the pipeline; it is widely recognized as a milestone in deep network training, enabling larger-scale, deeper, and more complex models to train stably.

3. mathematical foundations

normalization process

Let us define the forward-pass transformation done by batch normalization more rigorously. Suppose at some layer of the network, we have a mini-batch $B = \{ \mathbf{x}_1, \ldots, \mathbf{x}_m \}$ of size $m$ . For simplicity, I will denote each $\mathbf{x}_i$ as a vector of dimension $d$ (these could be, for instance, the pre-activation outputs from the previous layer). Then each coordinate $k \in \{1,2,\ldots,d\}$ in $\mathbf{x}_i$ is normalized as follows:

\mu_{B}^{(k)} = \frac{1}{m} \sum_{i=1}^{m} x_i^{(k)}, \quad \sigma_{B}^{2 (k)} = \frac{1}{m} \sum_{i=1}^{m} \bigl(x_i^{(k)} - \mu_B^{(k)}\bigr)^2

These are respectively the mean and variance of the $k$ -th dimension across the batch. Then we normalize:

\hat{x}_i^{(k)} = \frac{x_i^{(k)} - \mu_B^{(k)}}{\sqrt{\sigma_{B}^{2 (k)} + \epsilon}}

Here, $\epsilon$ is a small constant (like $10^{-5}$ ) for numerical stability to avoid division by zero.

After this standardization step, we apply a learned scale and shift:

y_i^{(k)} = \gamma^{(k)} \hat{x}_i^{(k)} + \beta^{(k)},

yielding the final output of batch normalization for the $k$ -th coordinate. The set of parameters $\{\gamma^{(k)}\}$ and $\{\beta^{(k)}\}$ are learned jointly with the other parameters in the network (like weights and biases of other layers) by backpropagation. The result is a normalized output $\hat{x}_i^{(k)}$ that has zero mean and unit variance across the mini-batch, re-scaled and shifted according to $\gamma$ and $\beta$ .

normalized outputs and learnable parameters

The presence of $\gamma$ and $\beta$ is crucial. If we forced the layer to have mean zero and variance one, we might hamper its representational capacity. The network might need a different scale or offset in order to generate a certain intermediate representation. By letting the network learn these, batch normalization retains the benefits of having stable and well-behaved distributions, while preserving the representational power.

Note that $\beta$ can absorb the role of what used to be a bias term in many network layers. Often, we can omit the bias in a layer after using batch normalization because $\beta$ already plays that role.

backpropagation through batch normalization

Training a model with batch normalization means we need to backpropagate through the mini-batch statistics. The partial derivatives can be summarized by:

\frac{\partial l}{\partial \hat{x}_i} = \frac{\partial l}{\partial y_i} \cdot \gamma

\frac{\partial l}{\partial \sigma_B^2} = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x}_i} (x_i - \mu_B) \cdot \frac{-1}{2} (\sigma_B^2 + \epsilon)^{-3/2}

\frac{\partial l}{\partial \mu_B} = \sum_{i=1}^{m} \frac{\partial l}{\partial \hat{x}_i} \cdot \Bigl(- \frac{1}{\sqrt{\sigma_B^2 + \epsilon}}\Bigr) \;+\; \frac{\partial l}{\partial \sigma_B^2} \cdot \frac{\sum_{i=1}^m -2(x_i-\mu_B)}{m}

\frac{\partial l}{\partial x_i} = \frac{\partial l}{\partial \hat{x}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \;+\; \frac{\partial l}{\partial \sigma_B^2} \cdot \frac{2(x_i - \mu_B)}{m} \;+\; \frac{\partial l}{\partial \mu_B} \cdot \frac{1}{m}

\frac{\partial l}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial l}{\partial y_i} \cdot \hat{x}_i, \quad\quad \frac{\partial l}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial l}{\partial y_i}.

Here, the notation $x_i$ implicitly refers to a specific dimension, but the concept generalizes across all dimensions in the vector $\mathbf{x}_i$ . As the mini-batch size $m$ grows, we get more robust estimates of $\mu_B$ and $\sigma_B^2$ . At inference time, we replace $\mu_B$ and $\sigma_B^2$ by moving averages that were collected during training, so that we can run predictions on single examples or any batch size without having to rely on the current batch's statistics.

4. benefits of batch normalization

reducing internal covariate shift

As I have said, internal covariate shift occurs when updates to early layers cause the distribution of activations feeding into deeper layers to shift dramatically. Because the deeper layers always see a changing input distribution, they require extra care to train stably. By applying batch normalization at intermediate points, we keep the distribution of these inputs stable across training iterations, thereby reducing this internal covariate shift.

Some argue about whether the term "internal covariate shift" is an entirely accurate characterization. Regardless of the nuances, the method indeed reduces the volatility of internal hidden distributions, which in practice yields better training dynamics.

improved convergence speed

With batch normalization, networks often converge in fewer epochs. Each individual update is more stable, so you can use larger learning rates or train deeper networks without blowing up the gradients. This was a major impetus behind the mass adoption of batch normalization in the deep learning community.

regularization effects and reduced overfitting

Another pleasant side effect is that batch normalization can provide a regularizing effect. By making each sample's activation depend on other samples in the batch (through the mean and variance calculation), the network introduces a slight noise in the hidden activations for each training example. This effect is somewhat analogous to dropout — though not exactly the same — and can help reduce overfitting. In some cases, you might even reduce or remove other forms of regularization (like dropout) when you use batch normalization.

The presence of $\gamma$ and $\beta$ helps ensure that batch normalization does not harm expressivity. Instead, the normalization merely re-anchors the representation space in each layer. This extra flexibility is often crucial for achieving strong final performance.

more robust to initialization

A well-known pain point in neural networks is how to initialize parameters so that gradients neither explode nor vanish. Batch normalization can help reduce the reliance on very careful weight initialization. Because the hidden activations are re-centered and re-scaled, even if the initial weights produce higher or lower distributions than expected, the normalization step can help keep them in a manageable range, making training stable right from the start.

5. implementation

using batch normalization layers in popular frameworks

Modern deep learning frameworks make batch normalization easy to apply. Typically, you insert a batch normalization layer right after a linear or convolutional transform and before the nonlinearity (although in some architectures, it might come after the nonlinearity — this depends on convention and experimentation).

Broadly, the step is:


# Pseudocode structure

out = linear_layer(input, W, b)
out_bn = BatchNorm(out, is_training=True, gamma, beta, momentum, eps)
out_activation = ReLU(out_bn)  # for example

In a typical scenario, we let the framework handle the running average of batch means and variances. During inference, we set is_training=False, and the transform uses those running statistics.

tensorflow / keras

In TensorFlow or Keras, a batch normalization layer can be inserted as:


import tensorflow as tf

# This example uses the high-level Keras API in TF 2.x
def build_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(128, input_shape=(784,)),  # Some layer
        tf.keras.layers.BatchNormalization(),            # BN layer
        tf.keras.layers.Activation('relu'),              # Activation
        # ...
    ])
    return model

In Keras, by default, there are arguments like momentum, epsilon, and others that you can tweak:

momentum: controls how the moving averages of mean and variance are computed over successive batches. A typical default is around 0.99 or 0.9.
epsilon: a small value to add to the variance for numerical stability, often 1e-3 or 1e-5.
beta_initializer and gamma_initializer: for initializing the learnable parameters.

pytorch

In PyTorch, you do:


import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc = nn.Linear(784, 128)
        self.bn = nn.BatchNorm1d(128)   # BN for a fully connected layer
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc(x)
        x = self.bn(x)
        x = self.relu(x)
        return x

For convolutional layers, you would use info nn.BatchNorm2d or info nn.BatchNorm3d, depending on dimensionality. The logic is the same, but in 2D or 3D form, the batch norm layer normalizes each channel map over the batch, width, and height (for images).

code snippets and walkthrough

Let me provide a bit more thorough example in PyTorch, demonstrating how to configure the hyperparameters:


import torch
import torch.nn as nn
import torch.optim as optim

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
        self.bn1   = nn.BatchNorm2d(32, eps=1e-5, momentum=0.9, affine=True)
        self.relu  = nn.ReLU()
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3)
        self.bn2   = nn.BatchNorm2d(64, eps=1e-5, momentum=0.9, affine=True)
        self.fc    = nn.Linear(64*24*24, 10) # for a 28x28 input

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)

        # flatten
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


def train_example():
    model = SimpleCNN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

    # Suppose we have some dataloader that fetches MNIST images
    for epoch in range(10):
        for images, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    # During inference, PyTorch automatically uses the running averages:
    model.eval()
    # ...

Here, $\text{momentum}$ is set to 0.9 by default. This means PyTorch uses an exponential moving average to track the batch mean and variance of each channel. The parameters $\gamma$ and $\beta$ (if affine=True) are also learned.

recommended default settings

Most frameworks set a default epsilon to around $10^{-5}$ or $10^{-3}$ and a default momentum near 0.9 or 0.99. Empirically, these defaults are usually good starting points. Typically, you do not need to heavily tune them.

practical considerations and best practices

Placement in the architecture: Commonly, you apply batch normalization immediately before the activation function. For instance, $\mathrm{Conv} \rightarrow \mathrm{BN} \rightarrow \mathrm{ReLU}$ . Some architectures, however, do it differently, like $\mathrm{Conv} \rightarrow \mathrm{ReLU} \rightarrow \mathrm{BN}$ . Empirically, the first approach is often slightly more standard, but best practice can vary.
Remove biases in preceding layers: Since batch normalization can add a learnable offset ( $\beta$ ), you do not strictly need a bias term in your linear or convolutional layers. Setting bias=False might reduce redundancy.
Watch out for very small batch sizes: If your mini-batches are extremely small (e.g., 2 or 4 examples), the estimates of mean and variance become noisy, which can degrade performance. In such scenarios, alternative methods (layer normalization, group normalization, or batch renormalization) might yield more stable training.
Momentum for running statistics: The momentum parameter in your BN layer affects how quickly the moving average of means and variances adapts to new data. If you see poor performance at inference time, check if your running stats have converged or if you need a different momentum setting.
Initialization: Usually $\gamma$ is initialized to 1, and $\beta$ to 0. This way, at the beginning of training, the layer does not transform the normalized values, letting the rest of the network behave as if no BN was present. Over time, BN will adapt to the distribution of activations.

choosing the right batch size

Batch normalization depends on the batch's mean and variance, so the batch size matters. If the batch is too small, the statistics might be unreliable. Large mini-batches can produce more stable estimates, but you might not always have the resources (GPU memory, for example) to use large batches. Typical batch sizes for BN might range from 16 to 256, though the sweet spot depends on your application and hardware constraints. If you need to train with very small batches or large-scale distributed training where each worker sees only a small portion of data, consider alternatives like group normalization or batch renormalization.

hyperparameters: gamma, beta, and momentum

$\gamma$ (Gamma) and $\beta$ (Beta) are the learnable scale and shift. In PyTorch, these are typically named weight and bias of the batch norm layer. In Keras or TensorFlow, they might be explicitly called gamma and beta.
momentum is a hyperparameter used for the running averages. Despite the name, it is not the same as momentum in SGD; it is just a parameter that controls how quickly the statistics used at test time are updated.

dealing with small batches or large-scale distributed training

When you do distributed training, each device sees only a subset of the mini-batch, so if you globally batch-normalize, you need synchronization across multiple devices. Many frameworks have sync batch norm modules that handle this by calculating global means and variances across devices at each step. This can be more expensive, but is often necessary. If your effective batch size per worker is small, again, you might consider group normalization or layer normalization.

6. extensions and variations

Batch normalization inspired a wave of research on how best to normalize intermediate representations in deep networks. Let us discuss some of the key variants.

layer normalization

Layer normalization (Ba and gang, 2016) addresses a scenario where you might have to use very small batch sizes or sequences of variable lengths, as in certain RNN tasks. Instead of normalizing across the batch dimension, layer normalization normalizes across the feature dimension for each individual sample. That is, for each example in the batch, we compute the mean and variance of all hidden units in that layer, then re-scale them accordingly. This effectively decouples normalization from the batch dimension. It works especially well for recurrent networks and sometimes for Transformers, although in Transformers you often see pre-layer-norm or post-layer-norm variants that were introduced for training stability.

instance normalization

Instance normalization was originally proposed to address the style transfer problem, where normalizing across the entire batch of images can destroy instance-specific style details. Instead, you normalize each single example's feature map across spatial dimensions (in case of images). This method can preserve style or allow style transformations. It is widely used in tasks like image-to-image translation, style transfer, or generative models where each sample is supposed to maintain distinct style characteristics.

group normalization

Group normalization (Wu and He, 2018) is a technique that splits the channels of each sample into groups and normalizes each group separately. It aims to strike a middle ground between instance normalization and layer normalization. Group normalization does not rely on the batch dimension, so it is robust to smaller batches but can still preserve some of the statistical averaging across multiple channels. This is popular in detection and segmentation tasks where the batch size can be limited by GPU memory (e.g., in large image segmentation tasks, you might only fit 1 — 2 images per GPU).

batch renormalization

Batch renormalization (Ioffe, 2017) modifies batch normalization so that it can work better when the mini-batch statistics are poor estimates of the dataset statistics. The idea is to mitigate reliance on the batch's mean and variance by introducing correction terms that come from a running average. This can help especially when the batch size is small or the data is distributed in a complicated manner. For many use cases, standard batch normalization still suffices, but batch renormalization is an interesting approach if your batch size is severely restricted or if your data distribution is highly non-stationary.

comparison of normalization methods

There is no one-size-fits-all normalization. The best approach depends on your data characteristics, batch size constraints, and architecture type:

Batch Normalization: Great for large mini-batches, widely used in CNNs and MLPs.
Layer Normalization: Often chosen for RNNs, Transformers, or situations with small batch sizes.
Instance Normalization: Style transfer, generative tasks.
Group Normalization: Good compromise for tasks with small batch sizes or large images.
Batch Renormalization: A specialized fix for small or unrepresentative batches in typical BN setups.

other specialized variants

Beyond these main branches, there are further explorations: info Decorrelated Batch Normalization, info Streaming Normalization, info Conditional Batch Normalization for multi-task learning or style manipulation, info Recurrent Batch Normalization for RNNs, and more. These specialized solutions each revolve around the same fundamental principle: controlling the distribution of intermediate activations for more stable training and improved representational power.

7. experiments

effect on training speed and final accuracy

Batch normalization often drastically increases the training speed (in terms of epochs needed to converge), although it might add a small overhead per iteration. In practice, the total wall-clock time to achieve a certain accuracy is typically reduced significantly.

an mnist classification demonstration

Consider a fully connected network with a few hidden layers of 100 units each, using ReLU activation. Let us train this network on the MNIST handwritten digit dataset (28x28 grayscale images, 10 classes). We compare:

Network without BN: Just linear layers + ReLU.
Network with BN: Insert BN layers before ReLU in each hidden layer.

We use a learning rate of 0.01, batch size of 60, and a standard weight initialization. If we plot training accuracy versus training iterations, we typically see that the BN version rapidly jumps above 90% accuracy within the first thousand iterations, whereas the baseline might take many more iterations to get there.

Below is an illustrative snippet for a BN version in PyTorch:


import torch
import torch.nn as nn
import torch.optim as optim

class MLPwithBN(nn.Module):
    def __init__(self):
        super(MLPwithBN, self).__init__()
        self.fc1 = nn.Linear(784, 100)
        self.bn1 = nn.BatchNorm1d(100)
        self.fc2 = nn.Linear(100, 100)
        self.bn2 = nn.BatchNorm1d(100)
        self.fc3 = nn.Linear(100, 10)

    def forward(self, x):
        x = x.view(x.size(0), -1)  # flatten
        x = self.fc1(x)
        x = self.bn1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = nn.functional.relu(x)
        x = self.fc3(x)
        return x


def train_mnist_with_bn(dataloader):
    model = MLPwithBN()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    for epoch in range(10):
        for batch_idx, (images, labels) in enumerate(dataloader):
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

Plotting the training/validation curves usually shows that:

The BN network's training error drops faster in the early epochs.
The BN model often can reach a slightly higher final accuracy or do so with less tuning of learning rates.

impact on different architectures (cnns, rnns, transformers)

CNNs: Batch normalization is extremely common in convolutional neural networks such as VGG, ResNet, and many others. It is typically used after each convolution.
RNNs: Applying BN in recurrent models is trickier due to time-step dependencies. Variants such as Recurrent Batch Normalization have been proposed, but mainstream solutions for RNNs often prefer layer normalization, which is simpler for recurrent connections.
Transformers: Transformers commonly rely on layer normalization. However, in some variations, batch normalization might be used in the feed-forward components, or no normalization is used in some stages. The principle is similar — control the distribution of intermediate states for stable training.

code snippets for experiments

A typical experimental pipeline might look like this:


# High-level pseudocode for an experiment comparing BN vs No-BN

def experiment_bn_vs_nobn():
    # 1. Prepare data
    train_loader, valid_loader = get_mnist_loaders(batch_size=64)

    # 2. Define two models
    model_bn = MLPwithBN()
    model_nobn = MLPnoBN()

    # 3. Train both
    train(model_bn, train_loader, ...)
    train(model_nobn, train_loader, ...)

    # 4. Evaluate both
    acc_bn = evaluate(model_bn, valid_loader)
    acc_nobn = evaluate(model_nobn, valid_loader)

    print("Accuracy with BN:", acc_bn)
    print("Accuracy without BN:", acc_nobn)

One can then record how many epochs or steps each model takes to surpass a certain accuracy threshold, how stable the loss is across training steps, and the final accuracy or other metrics.

In image tasks like CIFAR-10, CIFAR-100, and ImageNet, the presence of batch normalization is nearly universal in modern CNN architectures. It is sometimes singled out as one of the factors that made training very deep networks feasible in the first place (e.g., ResNets rely heavily on batch normalization throughout their layers).

additional expansions and deep-dive discussions

Because batch normalization is so foundational, I will expand on a few more points that might further clarify advanced aspects or interesting nuances. These expansions can serve as optional reading if you want to see how batch normalization interacts with topics like partial updates, more exotic architectures, or advanced theoretical perspectives.

partial updates and internal covariate shift reconsidered

There has been debate about whether "internal covariate shift" fully explains BN's success. Some arguments revolve around the fact that normalization alone might not entirely remove distribution shifts, especially as the network evolves. Another line of thought proposes that batch normalization's success can also be tied to how it smooths the optimization landscape, making it easier for gradient descent to navigate. Indeed, a smoothed loss surface can significantly accelerate training. Regardless of the underlying reason, in practice, batch normalization addresses many training instability issues.

synergy with other techniques

Dropout: BN already has a noise-like effect, but combining BN and dropout can sometimes still be beneficial — especially in large networks.
Skip connections: BN can be used alongside residual or skip connections (as in ResNet). The ResNet architecture typically uses $\mathrm{Conv} \rightarrow \mathrm{BN} \rightarrow \mathrm{ReLU}$ patterns in the building blocks.
Weight decay: This form of L2 regularization often plays well with BN. However, if you do not want to regularize $\gamma$ or $\beta$ , you might have to exclude them from weight decay.

advanced topics in inference

When the network is deployed (i.e., at inference time), we no longer have a mini-batch of data from which to compute mean and variance. Instead, we rely on the exponential moving averages that were accumulated during training. A crucial detail is that if your final running statistics are inaccurate (e.g., if the training procedure changed the distribution drastically at some late stage), you might see performance degrade. Sometimes, people do a final pass called a "re-estimation of batch norm statistics" by running through the training set once more to refine the running means and variances. This can help calibrate the BN layers for inference if needed.

limitations of batch normalization

Although BN is powerful, it is not a panacea. It can fail if:

The batch size is too small or your data is not i.i.d. independent and identically distributed in each mini-batch.
The overhead of synchronizing batch statistics across many GPUs is high.
You are dealing with recurrent or sequential data where time-step dependencies complicate the assumption of independence within the batch.

In such cases, variants like group normalization, layer normalization, or instance normalization can be more robust.

bridging theory and practice

In advanced literature, researchers have studied the effect of BN on the optimization landscape. Some findings suggest that BN can make the gradients more well-conditioned, effectively allowing for larger updates. Another theoretical perspective is that BN re-centers the hidden space, preventing extreme outliers from saturating nonlinearities like ReLU. While not all these theories are definitive, the empirical success of BN in deep learning remains unchallenged.

conclusion (optional ending note)

Batch normalization, for many practitioners, is a near-default component in constructing neural networks — especially for feed-forward or convolutional architectures. It has proven to significantly speed up training, allow for deeper networks, reduce sensitivity to hyperparameters, and often improve generalization. The method's core idea — normalize intermediate activations across mini-batches — seems simple, but it was a major milestone in deep learning research. Its impact is apparent in almost every modern architecture from CNNs in vision tasks to certain advanced feed-forward networks in other domains.

If your batch size is not too small, or if your data is well-shuffled so that each mini-batch is a decent representation of the overall distribution, you can typically expect batch normalization to provide immediate benefits. If you have unusual constraints, be aware of alternatives like layer normalization, instance normalization, group normalization, or batch renormalization.

Whether you are building a standard CNN or experimenting with new architectures, batch normalization remains an indispensable technique to keep in your toolbox, and understanding its mathematical underpinnings, as well as its practical usage, is vital for any machine learning professional.

An image was requested, but the frog was found.

Alt: "batch_normalization_diagram"

Caption: "A conceptual diagram illustrating the forward pass of batch normalization in a convolutional neural network. The layer normalizes across the batch dimension and each channel."

Error type: missing path

\text{End of the batch normalization article.}

extended expansions to ensure thorough coverage and length

In this extended section, I will restate or deepen the coverage on topics related to batch normalization. The following expansions are partially reiterative yet angle deeper into the conceptual and theoretical frameworks, the historical evolution, and the nuanced practicalities. Although some of the material overlaps with earlier sections, revisiting it from multiple perspectives can help solidify understanding.

historical context and evolution

Before batch normalization took the deep learning field by storm, practitioners already recognized that normalizing inputs significantly improved training. However, few systematically attempted to normalize hidden layer activations in the middle of a network. The landmark paper by Sergey Ioffe and Christian Szegedy in 2015 introduced the idea of normalizing at each layer within mini-batches, framing it primarily as a solution to the internal covariate shift problem. This approach quickly gained acceptance because it addressed two major bottlenecks in training:

The need to manually tune learning rates to avoid exploding or vanishing gradients.
The difficulty in training very deep networks due to shifting distributions at each layer.

Early successes with BN included enabling deeper networks to train faster and achieve state-of-the-art results in the ImageNet classification challenge. Notably, models that integrated BN (such as new variants of Inception or ResNet) saw significant improvements in both speed and final accuracy.

Subsequent years witnessed a surge of variants:

Layer Normalization (2016): proposed to tackle tasks with small batch sizes or RNNs.
Instance Normalization (2016, 2017): especially for style transfer.
Batch Renormalization (2017): improving reliability of BN estimates with small batches.
Group Normalization (2018): bridging the gap between LN and BN in large-scale tasks with memory constraints.

Furthermore, specialized forms of BN also appeared for domain adaptation, multi-domain learning, conditional tasks, etc.

deeper dive into internal covariate shift

The concept of covariate shift is well-studied in classical machine learning: it occurs when the distribution of input features changes between training and testing. The word "internal" in internal covariate shift underscores that the phenomenon is happening inside the network between layers, not just between training and test sets. Because each layer's input can shift as prior layers learn, the deeper layers effectively see a non-stationary distribution over the course of training.

From a more formal perspective, consider a neural network with $L$ layers, each with parameters that update at each iteration. We can denote the input to layer $l$ by $\mathbf{z}^{[l]}$ . The distribution of $\mathbf{z}^{[l]}$ depends on the parameters of all preceding layers $1, 2, \ldots, l-1$ . Whenever those earlier layers change, so does $\mathbf{z}^{[l]}$ . BN tries to keep the mean and variance of $\mathbf{z}^{[l]}$ more consistent during training iterations by performing the normalization per mini-batch.

Critics argue that BN does not truly fix all forms of shifting distributions in deeper layers, especially once the layer's scale and shift parameters $\gamma, \beta$ are introduced. The distribution is still subject to changes in $\gamma, \beta$ themselves. Nonetheless, it remains that BN empirically stabilizes and accelerates training, which is the principal practical reason behind its widespread use.

training with large vs. small batch sizes

One of the biggest disadvantages of BN is the reliance on sufficiently large mini-batches to get reliable estimates of mean and variance. If the batch is too small:

The sample mean and variance might be highly noisy, leading to unstable updates.
The performance might degrade, or the model might not converge properly.
In extreme cases (like 1 or 2 images per batch), BN can hamper training more than it helps.

Practitioners have responded by using advanced techniques like virtual batch normalization (VBM) or micro-batch accumulation, or by switching to alternative normalizations. For example, if you can only fit a single example on your GPU at a time, you can accumulate gradients across multiple forward passes and only then do the BN step with an effective batch. Alternatively, you might switch to group normalization, which does not rely on the batch dimension at all.

bridging to statistical theory

From a statistical standpoint, normalizing data typically helps gradient-based methods because it avoids certain pathological curvatures in the loss landscape. If we recall the concept of the $\mathrm{cov}(\cdot,\cdot)$ of features, having higher correlation among features can hamper gradient-based optimization. BN breaks up some of these correlations by ensuring each dimension has the same scale, at least within a mini-batch. Additionally, the partial derivatives can become more stable as the layer's outputs remain in a narrower range. This can reduce or eliminate saturations in nonlinearities, such as a ReLU or a sigmoid.

discussion on the hyperparameters

$\epsilon$ (epsilon): Usually chosen to be $10^{-5}$ or $10^{-3}$ . If you see "checkerboard artifacts" or unexpected performance issues, adjusting epsilon might help, though it is rarely the main culprit.
momentum: Typically 0.9 or 0.99. The difference is in how quickly the running mean and variance adapt. If momentum is too high, the estimates can lag behind changes in the distribution. If it is too low, your estimates might be more volatile.
Batch size: As mentioned, bigger is generally better, but practical constraints abound.
Placement: Standard practice is $\mathrm{Conv} \rightarrow \mathrm{BN} \rightarrow \mathrm{ReLU}$ (or another activation). Some practitioners have found no big difference if they switch BN and ReLU, but the official BN paper recommended placing it before the nonlinearity.

computational overhead

Batch normalization does introduce extra computational steps:

Mean and variance must be computed across each mini-batch.
The output has an additional elementwise transformation ( $\hat{x}_i \mapsto \gamma \hat{x}_i + \beta$ ).
During backprop, partial derivatives with respect to $\mu_B$ and $\sigma_B^2$ must also be computed, though frameworks do this efficiently.

For typical moderate or large batch sizes, the overhead is not severe, and the net effect is beneficial because fewer epochs are needed to converge. On modern GPUs, the overhead is minimal relative to the entire forward and backward pass of a large CNN.

advanced research trends

Decorrelated Batch Normalization: This normalizes not just the first and second moments but also attempts to whiten the features so that they become decorrelated.
Conditional Batch Normalization: The scale and shift parameters $\gamma, \beta$ are replaced by functions of some condition, such as class labels or textual input. This is popular in tasks such as style transfer or multi-domain generation.
Revisiting Normalization-Free Networks: A small subset of research explores networks that do not rely on BN or LN, e.g., using scaled weight initializations or carefully chosen skip connections that preserve variance. These "normalization-free" networks aim to reduce memory overhead or handle tasks where BN fails.

practical notes on partial vs. full batch

Sometimes, people worry about the difference between "batch" in the name "batch normalization" and the typical "mini-batch" used in training. Strictly speaking, the original BN approach is mini-batch normalization. The dataset can be extremely large, but we are only computing means and variances on each mini-batch. This is usually enough to yield stable estimates, especially if the mini-batch is randomly sampled from the dataset.

re-centered recaps of forward and backward passes

\textbf{Forward pass:}

$\mu_B = \frac{1}{m}\sum_i x_i$
$\sigma_B^2 = \frac{1}{m}\sum_i (x_i - \mu_B)^2$
$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
$y_i = \gamma \hat{x}_i + \beta$

\textbf{Backward pass:}

We consider partial derivatives $\frac{\partial l}{\partial x_i}$ , $\frac{\partial l}{\partial \gamma}$ , $\frac{\partial l}{\partial \beta}$ . The derivations revolve around the chain rule applied to the above transformations. In practice, frameworks like PyTorch or TensorFlow handle this automatically via computational graphs.

synergy with skip connections (resnets)

In ResNet and many advanced CNN architectures, BN is typically used at multiple points in each residual block. For example, in a classic ResNet building block:


x -> Conv -> BN -> ReLU -> Conv -> BN
|___________________________________ + -> ReLU

The skip connection sums the input to the block with the output of the second BN. This synergy works well because BN ensures that the residual branch does not blow up or vanish. Then the final ReLU also receives stable distributions from the block's output.

emergent phenomena

Empirically, one emergent phenomenon is that BN often allows stable training with a range of higher learning rates than one might normally attempt. This is especially helpful if you want to train quickly with a large learning rate and then do a learning rate schedule or warm restarts. Another phenomenon is that BN can help networks generalize better. The slight mismatch between training and inference modes can even act like a mild regularizer. However, there have been edge cases where a mismatch in distributions between training and inference can cause performance drops if the running mean and variance are inaccurate.

domain shift and finetuning

When performing domain adaptation or transfer learning, one might face a new data distribution. The stored running statistics in BN layers might not reflect the new distribution well. A solution is to update or "finetune" the BN layers on the new domain data. That is, allow new means and variances to be computed and stored for the new domain, while optionally freezing or lightly adjusting other parts of the network. This technique, often called "re-BN," can be a powerful approach to quickly adapt to domain shifts.

extremely large-scale training

In massive distributed training scenarios (e.g., training on hundreds of GPUs), the local mini-batch on each GPU might be small, but the overall global batch across all GPUs can be large. Some frameworks implement synchronous BN that computes the mean and variance across all GPUs. This keeps the BN estimates consistent throughout the distributed system. Alternatively, if we do asynchronous or local BN, we risk each GPU having different estimates of the distribution, which can hamper convergence.

debugging tips

If your training accuracy suddenly plummets when you switch the network from training to inference mode ( $\texttt{model.eval()}$ in PyTorch), suspect that your BN running statistics might be off. You can debug by looking at the saved running means and variances in each BN layer and comparing them with the actual batch statistics. If the mismatch is large, re-check your training procedure or consider doing a separate pass to re-estimate the BN stats.
If you see "nan" or "inf" values, verify that $\epsilon$ is not set too small, or that your batch variance is not zero. Another possibility is that the learning rate is still too high.

concluding expansions

Batch normalization remains the default normalization technique in many domains. Despite that, if you find yourself debugging training issues, do not forget to examine the BN layers carefully: the default hyperparameters might not be optimal in every scenario, especially if your mini-batch distribution differs drastically from typical assumptions. The synergy between BN, correct random shuffling, and a sufficiently large batch size can yield extremely stable and fast training.

Moreover, as new architectural paradigms (like Transformers) continue to reshape the machine learning landscape, the fundamental idea of normalizing internal representations has persisted in various forms. The general principle behind BN — that controlling internal activation scales can drastically ease training — continues to be validated across numerous tasks and architectures.

Below is one more code snippet for advanced usage, demonstrating how one might implement a custom BN approach in PyTorch that explicitly tracks the means and variances:


import torch
import torch.nn.functional as F

class CustomBatchNorm(torch.nn.Module):
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super(CustomBatchNorm, self).__init__()
        self.num_features = num_features
        self.eps = eps
        self.momentum = momentum

        self.gamma = torch.nn.Parameter(torch.ones(num_features))
        self.beta = torch.nn.Parameter(torch.zeros(num_features))

        self.running_mean = torch.zeros(num_features)
        self.running_var = torch.ones(num_features)

    def forward(self, x):
        if self.training:
            # compute mean/var per feature dimension
            mean = x.mean(dim=(0,2,3), keepdim=True)  # for 2D conv shape
            var = x.var(dim=(0,2,3), keepdim=True, unbiased=False)

            # update running stats
            self.running_mean = (1 - self.momentum) * self.running_mean +                                 self.momentum * mean.squeeze()
            self.running_var = (1 - self.momentum) * self.running_var +                                self.momentum * var.squeeze()

            # normalize
            x_hat = (x - mean) / torch.sqrt(var + self.eps)
        else:
            mean = self.running_mean.view(1, -1, 1, 1)
            var = self.running_var.view(1, -1, 1, 1)
            x_hat = (x - mean) / torch.sqrt(var + self.eps)

        gamma = self.gamma.view(1, -1, 1, 1)
        beta  = self.beta.view(1, -1, 1, 1)
        return x_hat * gamma + beta

In this custom layer, I explicitly compute the per-channel mean and variance over spatial dimensions and across the batch dimension, then update the running statistics accordingly. This code snippet clarifies the underlying mechanics of BN that frameworks usually handle for you automatically.

Through these expansions, I hope to have provided a thoroughly comprehensive overview of batch normalization, from its fundamental formulas to advanced practical tips and to its broader context in deep learning. If you are building neural networks for modern tasks, it is almost certain you will rely on BN or a close variant of it, making a deep understanding invaluable.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content