banner
Gradient optimization
Everybody gangsta until the real data
#️⃣   ⌛  ~50 min 🗿  Beginner
02.02.2023
upd:
#31

views-badgeviews-badge
banner
Gradient optimization
Everybody gangsta until the real data
⌛  ~50 min
#31


🎓 22/167

This post is a part of the Mathematical optimization educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!


Gradient optimization lies at the heart of nearly all modern machine learning (ML) methods, powering everything from classical linear regression models to massive deep neural networks. When we train a model — be it a simple regression model or a state-of-the-art Transformer — our objective is to adjust the model's internal parameters in a way that minimizes some loss or cost function. This minimization is rarely performed analytically, as exact solutions often do not exist for complex models or might be computationally intractable. Instead, we rely on iterative optimization procedures that exploit the gradient of the loss with respect to the model parameters.

A major reason gradient optimization has become so ubiquitous is that it scales relatively well to high-dimensional parameter spaces and large datasets. This is particularly crucial in today's world of deep learning, where models may consist of hundreds of millions (or even billions) of parameters. As a result, expertise in gradient-based optimization is essential for designing, training, and understanding a broad range of machine learning algorithms.

In this article, we will dive into the fundamentals of gradient optimization, revisit the basics of loss minimization in ML, explore gradient descent (the most widely used gradient optimization algorithm), and then examine its three main variants: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. Along the way, we will study both theoretical underpinnings and practical considerations. Our focus here is on the "classical" versions of gradient-based optimization. In a subsequent article, we will discuss advanced modifications, including adaptive learning rate methods (e.g., Adam, RMSProp, Adagrad) and second-order approaches (e.g., Newton's method, quasi-Newton methods).

Whether you come from a background in machine learning, data science, or a related field, understanding the core mechanics and nuances of gradient optimization is foundational. By the end of this piece, you should have a deeper grasp of how gradient-based optimizers work, why they are used, and how to implement and tune them in practice.

2. Revisiting the basics of ml optimization

In machine learning, a model is typically defined by a set of parameters (weights, biases, coefficients, or any other internal variables) that we wish to learn from data. For a supervised learning problem, we often have a labeled dataset {(x1,y1),(x2,y2),,(xm,ym)} \{(x_1, y_1), (x_2, y_2), \ldots, (x_m, y_m)\} . Each xix_i denotes the features for the ii-th instance, and yiy_i denotes the corresponding target (e.g., a class label or a continuous value).

The objective function (loss or cost function)

A core piece of machine learning is defining how we measure the performance of a model's parameters on the given data. This is typically done via a loss (or cost) function. For example, in linear regression with mean squared error (MSE) as the loss, our goal is to minimize:

L(w)=1mi=1m(yiy^i(w))2, L(\mathbf{w}) = \frac{1}{m} \sum_{i=1}^{m} \bigl(y_i - \hat{y}_i(\mathbf{w})\bigr)^2,

where y^i(w) \hat{y}_i(\mathbf{w}) is the model's prediction for the ii-th data point, and w\mathbf{w} represents the model parameters (weights). In classification tasks, one might use a cross-entropy loss, hinge loss, or another appropriate objective.

Parameter space and searching for minima

For many models, the parameter space can be extremely large (sometimes millions of dimensions). Directly solving for the global optimum can be very difficult or impossible in closed form. Instead, we rely on numerical optimization methods that iteratively refine an initial guess.

Significance of gradients in optimization

The gradient of the loss function with respect to the parameters tells us the local direction of steepest ascent in the loss landscape. To minimize the loss, we want to step in the opposite direction of that gradient. This insight underpins gradient descent: if the gradient of L(w)L(\mathbf{w}) at step tt is L(w(t))\nabla L(\mathbf{w}^{(t)}), we update our parameters as:

w(t+1)=w(t)ηL(w(t)) \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \,\nabla L(\mathbf{w}^{(t)})

where η\eta is the learning rate. While conceptually simple, properly tuning the gradient descent procedure can be an art in itself. Learning rate choice, initialization strategies, and iteration scheduling all factor into the final performance of our model.

3. Gradient descent

Gradient descent (GD) is one of the most fundamental and well-known optimization algorithms in machine learning. Its basic premise is:

  1. Start with an initial guess for the parameters (e.g., random initialization).
  2. Compute the gradient of the objective function (the cost or loss) with respect to the parameters.
  3. Update the parameters by moving a small step against the gradient (the direction of steepest descent).
  4. Repeat steps 2 and 3 until convergence or until a stopping criterion is met.

Key Terms to keep in mind:

  • Cost function (Loss function): The function you want to minimize (e.g., MSE, cross-entropy).
  • Gradient: The vector of partial derivatives of the cost function with respect to all parameters.
  • Convergence: A state where the parameters are no longer changing meaningfully, or a defined stopping criterion is met.
  • Learning rate: A hyperparameter (η\eta) that controls the size of the step in the direction of the negative gradient.

Gradient descent's concept is straightforward, yet the practical details — especially regarding convergence speed and numerical stability — are crucial to making it work effectively. For example, setting the learning rate too high can cause the parameters to oscillate wildly or diverge, while a too-small learning rate can lead to painfully slow training.

4. Math behind gradient descent

The mathematics of gradient descent can be understood by looking at how we compute partial derivatives of a chosen loss function and update the model parameters accordingly. In machine learning, these partial derivatives often correspond to how each weight in a neural network or a linear model influences the overall cost.

4.1 deriving the gradient descent formula

Suppose we have a cost function L(w)L(\mathbf{w}). The gradient L(w)\nabla L(\mathbf{w}) is the vector of partial derivatives:

L(w)=[L(w)w1L(w)w2L(w)wn]. \nabla L(\mathbf{w}) = \begin{bmatrix} \frac{\partial L(\mathbf{w})}{\partial w_1} \\ \frac{\partial L(\mathbf{w})}{\partial w_2} \\ \vdots \\ \frac{\partial L(\mathbf{w})}{\partial w_n} \end{bmatrix}.

Intuitively, each component of L(w)\nabla L(\mathbf{w}) measures how sensitive the loss is to changes in a particular parameter wjw_j. Moving against the gradient — i.e., subtracting a small multiple of the gradient from the current parameter vector — lowers the value of the cost function, at least in a local sense.

Mathematically, the update rule for gradient descent can be written as:

w(t+1)=w(t)ηL(w(t)), \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \,\nabla L(\mathbf{w}^{(t)}),

where w(t)\mathbf{w}^{(t)} is the parameter vector at iteration tt, and η\eta is the learning rate.

4.2 calculating partial derivatives and their importance

For many ML models — like linear or logistic regression — the partial derivatives of the loss function can be computed analytically. For instance, in linear regression with mean squared error, one can derive a closed-form gradient expression:

L(w)=2mXT(Xwy), \nabla L(\mathbf{w}) = \frac{2}{m} \mathbf{X}^T (\mathbf{X} \mathbf{w} - \mathbf{y}),

where X\mathbf{X} is the design matrix of input features, y\mathbf{y} is the vector of target values, and mm is the number of training samples.

In other models, especially neural networks, we often use backpropagation (the chain rule) to compute partial derivatives. Regardless of the model, calculating partial derivatives accurately and efficiently is essential to applying gradient descent.

4.3 convergence criteria

Determining when gradient descent "converges" can be somewhat subjective or dependent on application-specific needs. Common criteria include:

  • Parameter change: Stop when the difference w(t+1)w(t) \|\mathbf{w}^{(t+1)} - \mathbf{w}^{(t)}\| is below a certain threshold.
  • Gradient magnitude: Stop when L(w(t))\|\nabla L(\mathbf{w}^{(t)})\| becomes very small.
  • Maximum iterations: Stop after a fixed number of iterations or epochs.
  • Validation metric: In practice, we often monitor a validation set's performance. If it stops improving (or worsens), we may reduce the learning rate or halt entirely (early stopping).

In many real-world ML applications, early stopping based on validation performance is especially common, since it helps prevent overfitting.

4.4 learning rate and its impact

The learning rate η\eta is arguably the single most important hyperparameter in gradient descent. Its role is to determine how big a step we take in the negative gradient direction each time we update the parameters.

  • Too large η\eta: The parameter updates might overshoot the minimum. Loss values can explode or fluctuate drastically, resulting in divergence or chaotic behavior.
  • Too small η\eta: Convergence becomes very slow, potentially requiring an impractically large number of updates.
  • Variable learning rate: Many training regimens feature learning rate "decay" or scheduling, in which η\eta is reduced over time to refine the convergence process. A common schedule might be a simple decay: η(t)=η01+kt\eta^{(t)} = \frac{\eta_0}{1 + kt}, or an exponential decay like η(t)=η0αt\eta^{(t)} = \eta_0 \cdot \alpha^t, where kk or α\alpha is some constant chosen by the researcher.

In practice, the learning rate can be tuned by trial and error, grid search, random search, or more sophisticated automated hyperparameter optimization. The perfect setting is highly model- and dataset-dependent, making this a critical point of experimentation in real-world ML pipelines.

5. main types of gradient descent

While the classical idea of gradient descent updates is straightforward, the exact mechanism by which we compute the gradient (and the portion of the dataset used) can vary significantly. This gives rise to three main variants:

  1. Batch gradient descent (BGD): Uses the entire training set to compute the gradient at each step.
  2. Stochastic gradient descent (SGD): Uses one training example (or sometimes a very small subset) per gradient update.
  3. Mini-batch gradient descent (MBGD): Uses a small batch of training examples (e.g., 32, 64, or 256 samples) at each iteration.

Each approach has advantages and drawbacks, influencing how it handles large datasets, converges to minima, and generalizes to new data.

5.1 batch gradient descent

Batch gradient descent computes the gradient of the cost function by summing or averaging over all training examples before performing a single update. That is:

L(w)=1mi=1mLi(w), \nabla L(\mathbf{w}) = \frac{1}{m} \sum_{i=1}^{m} \nabla L_i(\mathbf{w}),

where Li(w)L_i(\mathbf{w}) is the loss contribution from the ii-th example, and mm is the total number of training examples.

Pros:

  • The gradient computation is exact for the current set of parameters (assuming no sampling, the entire dataset is used).
  • Often yields stable and predictable updates.

Cons:

  • Very slow for large datasets, as each update requires a pass over all the training data.
  • Consumes significant memory if the dataset does not fit comfortably in RAM or GPU memory.

Use cases:

  • Datasets that are moderately sized (where computing the gradient over the entire set is not prohibitively expensive).
  • Problems where stable, more deterministic updates are preferable.

5.2 stochastic gradient descent

Stochastic gradient descent (SGD) picks one training example (or sometimes a single random subset) at each iteration to compute an approximate gradient. Formally, if ii is chosen randomly from {1,2,,m}\{1, 2, \ldots, m\}, then:

L(w)Li(w). \nabla L(\mathbf{w}) \approx \nabla L_i(\mathbf{w}).

Pros:

  • Can be extremely fast and memory efficient, as it processes one example at a time (in the purest form).
  • Potentially escapes local minima more easily, because the gradient is "noisy."
  • Scales well to massive datasets (common in online learning scenarios).

Cons:

  • High variance in updates can result in an erratic convergence path.
  • Requires more careful tuning of the learning rate, often with a decay schedule.
  • The objective function does not necessarily decrease at every iteration (due to the sampling noise).

Use cases:

  • Very large datasets or streaming data.
  • Online or real-time machine learning applications where data arrives continuously.
  • Situations where memory resources are limited.

5.3 mini-batch gradient descent

Mini-batch gradient descent is in many ways a middle ground between batch and stochastic methods. It processes a small batch of training data (say bb examples) at each iteration:

L(w)1bibatchLi(w), \nabla L(\mathbf{w}) \approx \frac{1}{b} \sum_{i \in \text{batch}} \nabla L_i(\mathbf{w}),

where "batch" is a small random subset of the training set. For instance, if m=60,000m = 60,000 and the chosen mini-batch size is 64, each update uses 64 examples out of the 60,000.

Pros:

  • Less computationally heavy than full batch descent, since it only processes a small subset at a time.
  • Reduces variance compared to pure SGD, since multiple examples smooth out the gradient estimate.
  • Efficiently vectorizable on modern hardware (GPUs often favor certain batch sizes).

Cons:

  • Slightly more memory-intensive than pure SGD if the mini-batch is large, but typically not as large as full batch.
  • Still introduces some noise in the gradient, although less than SGD.

Use cases:

  • The most common approach in modern deep learning, as it balances computational efficiency with convergence stability.
  • Works well with GPU-based acceleration.
  • Generally recommended for medium to large datasets in practical ML scenarios.

Most deep learning frameworks default to mini-batch training. Researchers often tune the mini-batch size based on hardware constraints (e.g., GPU VRAM) and find that certain batch sizes can lead to better (or faster) convergence, depending on the problem and architecture.

6. practical implementation considerations

The success of gradient-based methods depends on several practical details that go beyond the core update rule. Here are a few key considerations:

  1. Data preprocessing:

    • Normalization or standardization: Bringing all features to a similar scale speeds up convergence by avoiding extremely elongated or skewed loss landscapes.
    • Shuffling: It is typically beneficial to shuffle the training data (or shuffle it in mini-batches) to avoid unwanted ordering effects.
  2. Monitoring convergence and validation performance:

    • Track the training loss and validation loss over iterations or epochs.
    • Consider using early stopping when the validation performance stops improving.
  3. Hyperparameter tuning:

    • Learning rate: Often the single most important parameter; must be chosen carefully.
    • Batch size: For mini-batch methods, the size of each batch can significantly affect both speed and performance.
    • Regularization: Methods like L2L_2 (ridge) or L1L_1 (lasso) can be integrated with gradient descent simply by adjusting the loss function to include regularization terms.
  4. Exploding/vanishing gradients:

    • In deep neural networks, the gradient can sometimes become extremely large or extremely small, causing numerical issues and hampering learning. Various techniques (e.g., gradient clipping, careful initialization, batch normalization, or sophisticated architectures) are used to mitigate these issues.
  5. Efficient computation:

    • Tools like vectorization (in NumPy, PyTorch, TensorFlow) can dramatically speed up gradient computations.
    • GPUs or TPUs excel at mini-batch gradient computations for deep learning tasks.
  6. Heuristic improvements:

    • Using momentum-based methods (to be discussed in detail in subsequent articles) can smooth out noisy updates.
    • Advanced learning rate schedules (e.g., warm restarts, cyclical learning rates) can sometimes yield better results than static schedules.

7. implementations

Below are two sample implementations that demonstrate how gradient descent might be coded from scratch in Python, focusing on a simple linear regression scenario. Following that, we'll briefly illustrate how one might use scikit-learn's SGDClassifier for classification tasks.


7.1 example: batch gradient descent for linear regression (numpy)

Here, we demonstrate how to implement batch gradient descent to learn the parameters of a simple linear regression model. Suppose we have a design matrix X\mathbf{X} (dimension m×nm \times n) and a target vector y\mathbf{y} of dimension mm.

<Code text={`
import numpy as np

def batch_gradient_descent(X, y, learning_rate=0.01, n_iterations=1000):
    """
    Perform batch gradient descent for a linear regression model.
    X is an m x n matrix of features.
    y is an m-dimensional vector of targets.
    """
    m, n = X.shape
    # Initialize parameters (weights) randomly or with zeros
    w = np.zeros(n)
    
    # Optionally, you can add a column of 1s to X externally for the intercept
    # or manage it separately. We'll assume X is already preprocessed.
    
    for iteration in range(n_iterations):
        # Predictions
        y_pred = X.dot(w)
        
        # Compute the gradient of the MSE cost function
        # L(w) = (1/m) * sum((y_pred - y)^2)
        # Gradient: (2/m) * X.T.dot(y_pred - y)
        
        gradient = (2.0 / m) * X.T.dot(y_pred - y)
        
        # Update rule
        w = w - learning_rate * gradient
        
        # (Optional) Monitor the loss
        if iteration % 100 == 0:
            loss = np.mean((y_pred - y) ** 2)
            print(f"Iteration {iteration}, Loss: {loss:.5f}")
    
    return w
`}/>
  • Initialization: We set the initial parameter vector ww to zeros (or random small values).
  • Learning rate: Default is 0.010.01.
  • Gradient computation: Uses the entire dataset on each update (batch GD).
  • Update step: wwηL(w) w \leftarrow w - \eta \, \nabla L(w).
  • Output: Returns the final learned weights.

In practice, you might add advanced features like dynamic learning rate schedules or early stopping. You would also preprocess the input data (e.g., normalization, mean-centering) before calling this function.

7.2 example: stochastic gradient descent for linear regression (numpy)

Below is a simplistic code snippet for stochastic gradient descent. Each iteration uses exactly one randomly chosen data point to update the weights. This can converge quickly in practice but is often noisy.

<Code text={`
import numpy as np

def stochastic_gradient_descent(X, y, learning_rate=0.01, n_epochs=5):
    """
    Perform stochastic gradient descent for a linear regression model.
    X is an m x n matrix of features.
    y is an m-dimensional vector of targets.
    """
    m, n = X.shape
    w = np.zeros(n)
    
    for epoch in range(n_epochs):
        # Shuffle the data to avoid cycles
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        for i in range(m):
            # Pick one example
            xi = X_shuffled[i, :].reshape(1, -1)
            yi = y_shuffled[i]
            
            # Predict
            y_pred = xi.dot(w)
            
            # Compute the gradient for this single example
            gradient = 2.0 * xi.T.dot(y_pred - yi)
            
            # Update
            w = w - learning_rate * gradient.flatten()
        
        # (Optional) Monitoring the overall loss at the end of each epoch
        total_loss = np.mean((X.dot(w) - y) ** 2)
        print(f"Epoch {epoch+1}, Loss: {total_loss:.5f}")
    
    return w
`}/>
  • Epoch: A single pass through all mm training samples.
  • Shuffle: We shuffle XX and yy each epoch for better convergence properties.
  • One-sample update: The gradient is computed using just one sample (ii).

Here, the parameters get updated mm times per epoch, one per training sample. This often allows faster initial progress but can be quite noisy, necessitating learning rate schedules or other smoothing techniques.

7.3 mini-batch gradient descent

A mini-batch approach is usually more efficient on modern hardware (particularly GPUs), especially for deep learning tasks. Conceptually, it is halfway between the above two approaches, so the implementation is similar; the main difference is that we pick (say) 32 or 64 samples at a time rather than 1 or mm.

<Code text={`
import numpy as np

def mini_batch_gradient_descent(X, y, learning_rate=0.01, n_epochs=5, batch_size=32):
    """
    Perform mini-batch gradient descent for a linear regression model.
    X is an m x n matrix of features.
    y is an m-dimensional vector of targets.
    """
    m, n = X.shape
    w = np.zeros(n)
    n_batches_per_epoch = m // batch_size
    
    for epoch in range(n_epochs):
        # Shuffle the data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        for b in range(n_batches_per_epoch):
            start = b * batch_size
            end = start + batch_size
            
            X_batch = X_shuffled[start:end, :]
            y_batch = y_shuffled[start:end]
            
            # Predictions
            y_pred = X_batch.dot(w)
            
            # Compute gradient on the mini-batch
            gradient = (2.0 / batch_size) * X_batch.T.dot(y_pred - y_batch)
            
            # Update
            w = w - learning_rate * gradient
        
        # At the end of each epoch, you could measure the global training loss
        total_loss = np.mean((X.dot(w) - y) ** 2)
        print(f"Epoch {epoch+1}, Loss: {total_loss:.5f}")
    
    return w
`}/>

This approach typically converges more smoothly than pure stochastic gradient descent and more quickly than batch gradient descent (especially for large mm).

7.4 example: using scikit-learn's sgdclassifier

Many frameworks provide ready-made SGD-based estimators. Below is a short example using scikit-learn's SGDClassifier on the classic Iris dataset:

<Code text={`
from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data  # shape (150, 4)
y = iris.target  # shape (150,)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and fit an SGD classifier
clf = SGDClassifier(
    loss='hinge',      # The 'hinge' loss gives a linear SVM
    penalty='l2',      # L2 regularization
    alpha=0.0001,      # Regularization parameter
    learning_rate='optimal',
    max_iter=1000,
    shuffle=True,
    random_state=42
)

clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Test set accuracy with SGDClassifier: {accuracy * 100:.2f}%")
`}/>
  • loss: By default, 'hinge' is used, which corresponds to a linear SVM approach.
  • penalty: Regularization method ('l2', 'l1', or 'elasticnet').
  • alpha: The coefficient of regularization.
  • learning_rate: 'optimal' is an adaptive method that scikit-learn uses to help with convergence.
  • shuffle: We usually shuffle data each epoch to improve performance.
  • random_state: Ensures reproducible results.

Because SGDClassifier is integrated into scikit-learn, it can handle the details of iteration, convergence detection, and even partial fitting on streaming data if necessary.

(optional) additional illustrations

Below are a few conceptual image placeholders that can be very helpful for visualizing gradient descent, especially for those who are more visually inclined:

mysterious_frog

An image was requested, but the frog was found.

Alt: "Illustration of gradient descent on a contour plot"

Caption: "A contour plot of a cost function in 2D parameter space, showing iterative steps moving downhill toward the minimum."

Error type: missing path

mysterious_frog

An image was requested, but the frog was found.

Alt: "Effect of different learning rates"

Caption: "Depiction of how a small vs. large learning rate can either converge slowly or overshoot the minimum, respectively."

Error type: missing path

mysterious_frog

An image was requested, but the frog was found.

Alt: "Batch vs. Stochastic vs. Mini-batch updates"

Caption: "Comparison of the three approaches: batch (large stable updates), stochastic (small noisy updates), and mini-batch (balanced approach)."

Error type: missing path

putting it all together

Gradient descent in its many forms is an indispensable tool in the modern machine learning toolbox. While simple on the surface, mastering the intricacies of gradient-based optimization can elevate the performance and stability of your models in real-world contexts. Key points to remember:

  • Choice of variant: Decide which gradient descent approach best suits your dataset size, memory constraints, and hardware. Mini-batch is the de facto standard in deep learning, whereas batch or even pure stochastic gradient descent might be more fitting in smaller-scale or streaming applications.
  • Learning rate: Tune it carefully. Experiment with different schedules and watch for signs of divergence or overly slow convergence.
  • Implementation details: Proper data preprocessing, random shuffling, regularization, and gradient checks can drastically improve training outcomes.
  • Further enhancements: Momentum, adaptive gradient methods (e.g., Adam, RMSProp), and second-order approaches can lead to faster or more robust convergence. We will discuss these methods in a subsequent article.

Modern ML practitioners often take these fundamentals for granted, but understanding precisely how gradients are computed and used for parameter updates gives you deeper insight into why certain methods and heuristics work as they do. It also opens the door to creative experimentation and innovation, whether you're tackling a standard classification task or pushing the boundaries of deep learning research.

Gradient optimization is a cornerstone — once you grasp it, you're better positioned to tackle everything from logistic regression to large-scale deep networks, from simple academic examples to real-time streaming data scenarios. Moreover, the same principles carry over when you move into advanced concepts: adaptive optimizers, large-batch training for supercomputing clusters, distributed gradient computations, and more.

Finally, keep in mind that while gradient descent is ubiquitous, it isn't always a panacea. Certain classes of problems may be amenable to specialized solvers or alternative optimization strategies. Nonetheless, for the vast majority of machine learning tasks, gradient-based optimization (in one form or another) is the proven workhorse driving model training from start to finish.

kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo
kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo