

🎓 61/167
This post is a part of the Probabilistic models & Bayesian methods educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.
I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!
When diving into probabilistic models — especially those that are undirected or energy-based — there is a central concept we cannot avoid: the partition function. I find it crucial to shine a bright spotlight on this because the partition function is at once ubiquitous and famously troublesome. It appears in virtually all undirected models, from Restricted Boltzmann Machines (RBMs) to Markov Random Fields (MRFs) and beyond, serving as the normalizing constant that transforms raw energy functions (or unnormalized log probabilities) into well-defined probability distributions. The partition function, often denoted as , is often the direct cause of computational nightmares: it involves summing (or integrating) a potentially astronomical number of configurations across a high-dimensional space. Because of its fundamental importance — as well as its notorious intractability — an entire ecosystem of approximation strategies, alternative learning objectives, and algorithmic cleverness has emerged around it.
The reason I dedicate a separate article to the partition function is the sheer depth of theoretical and practical methods that revolve around either avoiding it, approximating it, or circumventing its direct computation. In many advanced machine learning contexts, especially in the realm of energy-based models, the presence of the partition function in the log-likelihood invites an elaborate interplay of gradient terms, sampling techniques, and approximate inference strategies. If we don't fully understand the role of — and the difficulty it poses — our grasp of these learning algorithms will remain incomplete or superficial at best.
Role of the partition function in undirected probabilistic models
To understand what is, let's consider a general undirected model that assigns an energy (where is the data and denotes parameters) to each possible configuration of . The unnormalized probability of a particular configuration can be expressed as:
The partition function is then defined as
if is discrete, or
if is continuous. The fully normalized probability distribution is then:
Hence, is the global normalizer. If is very difficult to compute or approximate, it complicates any method that relies on evaluating or its gradient with respect to .
Challenges in high-dimensional spaces
In high-dimensional spaces — typical for image data, large-scale textual data, or other complex domains — computing exactly is typically out of the question. The number of configurations can be astronomically large. Even approximate summation or integration can be burdensome without careful sampling or specialized methods. This intractability is what led to the development of widely used approximation strategies like Contrastive Divergence and Noise-Contrastive Estimation, as well as specialized sampling methods such as Annealed Importance Sampling.
Importance of partition-function-free methods
An important thread in the literature is the desire to circumvent altogether. One might consider alternative objectives or surrogate losses that do not explicitly include . This has led to methods such as Pseudolikelihood and Score Matching, each with its own nuances and domain of applicability. These methods are often deeply connected with fundamental statistical theory. For instance, score matching has roots in older ideas from Fisher divergence and leads to interesting ways to estimate parameters by matching model gradients to data gradients. Meanwhile, pseudolikelihood builds upon the factorization properties of conditional distributions and is sometimes used in fields like social network analysis to handle large, complex networks of variables.
By the end of this article, I hope you will see not only the specific reasons why the partition function demands so much care, but also the variety of ways researchers have tackled it. The theoretical insights, the sampling-based methods, and the approximate objectives form a multifaceted toolbox. When confronted with the dreaded in your own projects, you can decide whether to approximate it directly, bypass it with an alternative learning criterion, or try something more exotic altogether.
2. The log-likelihood gradient
When we train an undirected model such as an RBM or a Markov Random Field using maximum likelihood, we typically consider the log-likelihood function:
where is the -th data point and is the size of the dataset. Expanding , we get:
Thus, the log-likelihood for the entire dataset is:
assuming a single for all data points, though in some contexts we might have a sum of energies if the model factorizes in certain ways.
Positive and negative phases
The gradient of typically separates into what many in the Boltzmann machine literature call the "positive phase" and "negative phase". Specifically,
When distributing the negative sign, you might see it more commonly expressed as:
Here the term involving has been rewritten as an expectation under the model distribution:
- Positive phase: corresponds to matching the model's parameters so that it assigns low energy (thus high probability) to observed data .
- Negative phase: corresponds to ensuring that the overall probability mass does not blow up, by nudging the parameters so that the model distribution does not assign too much probability to non-data configurations.
Why the partition function term makes the gradient hard
The crux of the difficulty is that to compute , one must sample from the current model distribution or otherwise approximate it. Direct summation or integration over is generally intractable, so we need to rely on methods such as Monte Carlo Markov Chain (MCMC) to approximate this expectation. If the MCMC chain mixes slowly or dimension is large, accurate sampling can become an expensive proposition.
The partition function's gradient thus stands out as the key computational burden. Indeed, the entire impetus behind many alternative objectives is to avoid explicit computation of or its gradient. As we shall see in subsequent chapters, approaches like Pseudolikelihood, Score Matching, and Noise-Contrastive Estimation are creative attempts to get around the intractable nature of the partition function while still capturing meaningful statistics about the data.
Interpreting the gradient of log Z
Another lens to see this from: can be interpreted in thermodynamic or statistical-physics terms as a normalizing factor ensuring that the "energy landscape" is consistent with a proper probability distribution. Its gradient effectively measures how changes in stretch or squash that landscape globally. Understanding this helps to see why we might approximate it locally (as in pseudolikelihood), or in a more global sense with approximate sampling (Contrastive Divergence), or by turning the problem into a classification problem (Noise-Contrastive Estimation).
3. Stochastic maximum likelihood and contrastive divergence
Historically, one of the early successes in dealing with the partition function within RBMs and other Boltzmann machines was the introduction of Contrastive Divergence (CD) by Hinton and subsequent refinements like Persistent Contrastive Divergence (PCD) or Stochastic Maximum Likelihood (SML). These methods revolve around MCMC sampling, but they do so in ways that reduce the computational cost and, crucially, the burden of having to approximate too precisely.
Naive MCMC approaches
Let us consider the naive approach to maximum likelihood. We know that the gradient has this negative-phase term requiring samples from the model distribution. One might try a straightforward Markov chain approach: for each gradient update, run a Markov chain from scratch (starting from a random initialization) to approximate the model distribution at the current . But in many high-dimensional models, this chain could take a prohibitive number of steps to mix properly. Doing that at each parameter update is unrealistic. The result is that naive MCMC maximum likelihood can be extremely slow, often so slow as to be infeasible.
Contrastive divergence (CD-k)
Contrastive Divergence proposed a more practical compromise: rather than sampling from the model distribution until full mixing, one initializes the chain with the actual data (often called the "clamped" state) and then runs a small number of Gibbs sampling steps — steps is typical — to get a "negative sample". The intuition is that starting from real data hopefully puts us in a region of high probability under the model, so we do not need as many steps to reach some approximate sample from . Then we use that sample to estimate the negative phase. Formally, the CD-k gradient update looks like:
where is the sample obtained after steps of Gibbs sampling starting from . This is sometimes referred to as a form of "short-run MCMC".
CD-k is fast, especially for an RBM where Gibbs sampling can be split into an alternating procedure between hidden and visible units in time. However, it introduces bias: the samples you get may not match the true stationary distribution of the model. Interestingly, for small , the method often works well in practice (particularly in generative pretraining of deep networks). Yet for pure maximum likelihood, it is no longer guaranteed that we are performing gradient ascent on the true log-likelihood.
Stochastic maximum likelihood (SML) or persistent contrastive divergence (PCD)
A refinement to mitigate the short-run bias is to maintain a set of "persistent" MCMC chains across parameter updates, rather than re-initializing from the data each time. This approach is known as Persistent Contrastive Divergence or Stochastic Maximum Likelihood (SML). The idea is:
- Initialize one (or more) Markov chains at random once at the beginning.
- For each update, sample a mini-batch of data for the positive phase.
- Use the current state(s) of the persistent Markov chain(s) for the negative phase, running a few Gibbs steps on these chains to update them.
- Update parameters based on the difference between the positive-phase gradient and the negative-phase gradient.
- Keep the updated chain states around for the next parameter update.
Because these chains are not reset to the data at each iteration, they can potentially explore the model distribution more globally. SML is an (asymptotically) unbiased estimator of the gradient under certain conditions (e.g., if the chain is long-lived enough to approximate the stationary distribution). Still, the success of SML in practice depends heavily on the mixing properties of the chain and the complexity of the energy landscape.
Other related details
Many research papers have proposed improvements, from parallel tempering to advanced sampling methods like Annealed Importance Sampling (which we will see in a later section) to reduce bias further and speed up mixing. In general, though, if your model is high-dimensional and has complicated energy contours, even these advanced MCMC-based approaches can be slow or get stuck in local modes. This persistent difficulty is a major driver of alternative methods that avoid the partition function entirely.
4. Pseudolikelihood
Pseudolikelihood is a venerable approach introduced by Besag in the 1970s. It seeks to circumvent direct computation of by replacing the joint likelihood with a product of conditional probabilities of each variable given all the others. Specifically, for a discrete variable set , the pseudolikelihood is:
where means "all variables except ". Taking the log of that product gives:
Avoiding the partition function via conditional probabilities
Why does this help? Each conditional probability can often be computed without requiring the full . For instance, in an Ising model or a discrete MRF, one might have:
and that summation in the denominator is only over the local states of (e.g., in an Ising spin system or a small discrete set). That is far easier to compute than the global , which sums over all variables simultaneously.
Strengths of pseudolikelihood
One immediate benefit is that pseudolikelihood drastically cuts computation time for models with discrete local variables — especially if the local variable state space is small. The method also has theoretical guarantees in certain asymptotic regimes, meaning that it can yield consistent parameter estimates under some assumptions on the dependency structure. In some specialized tasks such as missing data imputation, optimizing conditional distributions may actually be more natural than focusing on the full joint distribution.
Limitations in density estimation
However, pseudolikelihood can be a poor approximation to the true log-likelihood in settings where global dependencies matter a great deal. It does not necessarily place the correct amount of mass on joint configurations if each variable's local conditionals do not adequately capture the multi-way interactions present in . Thus, for pure density modeling or generative tasks, pseudolikelihood might fail to produce samples that match the global structure of the data. In fields like image modeling or any domain with complex dependencies, pseudolikelihood can produce subpar generative performance even if the local conditionals are well fit.
Implementation details
In practice, one typically just replaces the negative log-likelihood with the negative log-pseudolikelihood and does standard optimization. Gradient-based methods remain feasible, and indeed are often simpler than MCMC-based maximum likelihood. Another use case is in certain forms of cross validation or hyperparameter tuning, where the pseudolikelihood can be computed quickly and used as a proxy objective in place of the intractable true likelihood.
5. Score matching and ratio matching
Score Matching, introduced by Hyvärinen (2005), is another strategy for models that might have an intractable normalizing constant. The idea is elegantly different from pseudolikelihood. Instead of trying to maximize directly, you match the gradient ("score") of the log-density with respect to between model and data. A key advantage is that it can be done without ever computing .
Basic principle of score matching
Let be the score function, i.e., the gradient of the log-probability with respect to the data . Score matching tries to find so that:
is minimized, where is the true gradient of the data distribution (in practice, we do not know this function, but the method sets up an alternative objective that is computable). Hyvärinen derived a neat trick that the difference of these gradients can be simplified so that the partition function disappears from the resulting formula. In short, the objective for score matching does not require .
Interpretation
Intuitively, if two distributions match in terms of the derivatives of their log-densities, they must be the same distribution (under suitable regularity conditions). So matching the score function is a proxy for matching the distribution itself. Score matching can be particularly elegant in continuous spaces, e.g., in modeling natural images or other real-valued data, provided we can define in a differentiable manner.
Incompatibility with discrete-variable models
A known drawback: standard score matching is not well-defined for discrete-variable models because you cannot treat as a smooth function in a continuous domain. One can attempt modifications for discrete data, but this is less straightforward and can degrade into complexity. Indeed, methods like ratio matching have been proposed for certain discrete scenarios.
Ratio matching
Ratio matching tries to match in local neighborhoods. As with score matching, the idea is to circumvent direct normalizing constants. Ratio matching sometimes finds use in specialized contexts, but it is far less mainstream than either pseudolikelihood or score matching. One might see it in cases where is partially known or can be replaced with local empirical densities. For the majority of large-scale discrete tasks, though, methods such as Noise-Contrastive Estimation or Contrastive Divergence remain more common.
6. Denoising score matching
Denoising score matching extends the core idea of score matching by introducing an explicit noise model. It is deeply related to Denoising Autoencoders and has seen a recent surge of interest for tasks like generative modeling of images in the form of Denoising Diffusion Probabilistic Models.
Motivation for adding noise
One practical issue with score matching is that the model might overfit the data's local geometry, especially if there is insufficient coverage of the space. By adding noise to — essentially observing a perturbed version of — you gain a smoothed version of the data distribution. The method trains the model to predict the score of the underlying clean distribution from the noisy samples. This also helps address regions of the space where the data is sparse, since you do not want your model to blow up arbitrarily in small pockets.
Mathematical form
The standard formulation, as introduced by Vincent in the context of denoising autoencoders, is that given a noisy sample created by adding a small Gaussian perturbation to , the network (or model) tries to reconstruct the original , or equivalently, to learn the gradient of the log of the clean data density. This can be cast as a form of score matching:
Often, is a known Gaussian corruption process. The partition function of the underlying distribution still never enters explicitly, so we circumvent .
Relationship to autoencoder training
Denoising autoencoders similarly attempt to reconstruct the clean data from noisy input, effectively learning a latent representation that discerns the manifold of real data. The gradient-based viewpoint of denoising score matching unifies these ideas: in certain architectures, learning to denoise is analogous to learning local directions pointing back toward the data manifold, i.e., the negative gradient of the energy. This synergy has led to state-of-the-art generative modeling approaches, including score-based generative modeling in continuous time (also known as diffusion models) that unify the notion of iterative denoising with a continuous Markov process that can be reversed to generate new samples.
7. Noise-contrastive estimation
Noise-Contrastive Estimation (NCE), introduced by Gutmann and Hyvärinen, is a technique that recasts unsupervised density estimation as a supervised binary classification problem. The partition function is learned as a parameter (often called the "bias" or offset) that helps discriminate between "real data samples" and "noise samples".
Reduction to binary classification
The key idea is: create a "noisy distribution" that you can sample from easily, such as a uniform distribution or some other convenient distribution. Then combine real data samples (labeled ) with noise samples () into a single dataset. Train a classifier that tries to predict whether a sample is real or noise by looking at minus . More concretely, define:
where is the logistic sigmoid function. If is unnormalized as , you can treat as a parameter that is learned via gradient-based classification. Minimizing the cross-entropy classification loss with a balanced set of real and noise examples yields consistent estimates of both the model parameters and the partition function offset.
Simultaneous estimation of parameters and partition function
The partition function effectively shows up in the log-odds for distinguishing real from noise. If is a free parameter, gradient descent will adjust it so that the separation between real and noise samples is maximized. One advantage is that you never explicitly sum over the entire data space. Another is that we can choose in a way that makes sampling trivial.
Advantages and drawbacks
- Advantages: NCE can be an elegant approach when dealing with large vocabulary but relatively simpler single-variable or low-dimensional contexts, e.g., certain language modeling tasks. It's also straightforward to implement since it reuses standard binary classification routines.
- Drawbacks: In large multi-variable contexts with complex dependencies, choosing a suitable noise distribution can be non-trivial. If is too simplistic, the classifier finds the discrimination task too easy, leading to poor density estimates. On the other hand, generating noise samples that mimic real data intricacies might defeat the purpose. Moreover, the approach typically scales poorly if the dimensionality or complexity of is high unless you have a very clever or domain-specific noise model.
8. Estimating the partition function
Even though many methods circumvent direct computation of , there are still scenarios where we may want an actual estimate of . For instance, in evaluating the absolute log-likelihood of a generative model, or comparing models by their on a validation set, or computing certain Bayesian model selection criteria, we need an estimate of . This final chapter surveys some of the widely used estimation techniques — mostly based on importance sampling variants and bridging methods.
Importance sampling
Importance sampling is a standard technique for estimating integrals or sums with respect to a difficult distribution by using samples drawn from an easier proposal distribution . Suppose we want:
We can write:
Hence, if we sample i.i.d. from , an unbiased estimator is:
Choosing so that it resembles can reduce variance. But for high-dimensional distributions, finding or sampling from such an is not trivial. If does not adequately cover the important regions, the estimator can suffer from extremely high variance.
Annealed importance sampling (AIS)
Annealed Importance Sampling, proposed by Neal (2001), is a significant improvement for evaluating . The method constructs a sequence of intermediate distributions bridging a tractable base distribution (e.g., a factorized Gaussian) and the target distribution . One typically uses a temperature parameter that goes from 0 to 1 in small steps:
where . For each step, you do importance sampling or MCMC transitions to move from to . The final product of importance weights across the chain yields an estimate of the ratio , where is the known normalizer of the base distribution. AIS often yields lower-variance estimates than a single-step importance sampling, given the chain of intermediate bridging distributions allows a more incremental adaptation from the easy distribution to the difficult one.
Bridge sampling
Bridge sampling is a related idea in which you define a single "bridge" distribution that lies somewhere between the base and target distributions. One sets up an estimator for that uses samples from both the base distribution and the target distribution with a bridging function that helps reduce variance. AIS can be viewed as an extension of bridge sampling in which you use multiple bridging distributions in a chain rather than just one.
Linked importance sampling and variance reduction
Numerous variants exist, such as linked importance sampling, sequential Monte Carlo, and other advanced techniques that attempt to reduce variance by carefully correlating samples or controlling the shape of the bridging distributions. Each has its own set of pros and cons. The main takeaway is that if you truly need an accurate estimate of (or ), these bridging or annealing approaches usually perform better than naive importance sampling.
Practical considerations
- When to estimate : If your downstream task only depends on unnormalized probabilities or partial derivatives of , you might circumvent . Indeed, many modern generative modeling frameworks do not bother with explicit estimation.
- Computational cost: AIS and related methods require carefully tuned MCMC steps and bridging schedules. They can be computationally expensive.
- Model comparison: In practice, if you want to compare two models by their log-likelihood, approximate partition function estimates can have non-negligible variance. Tools like PSRF (Potential Scale Reduction Factor) from MCMC diagnostics or repeated runs can help gauge reliability.

An image was requested, but the frog was found.
Alt: "Partition function schematic"
Caption: "A schematic illustration of how an unnormalized model distribution compares to a known base distribution. The partition function is the ratio between the integral under the unnormalized function and the integral under the base distribution, bridging them is the essence of AIS."
Error type: missing path
Below, I include a brief code snippet showing how one might implement a simplified AIS procedure in Python for a toy 2D energy function — just to illustrate the concept. Naturally, for real-world high-dimensional data, one must incorporate more sophisticated MCMC transition operators and schedules for .
import numpy as np
def energy_function(x, y):
# Simple 2D double well or ring structure as an example
r2 = x**2 + y**2
return 0.1 * (r2 - 1.0)**2 # Some contrived energy for demonstration
def base_log_prob(x, y):
# Base distribution: e.g., isotropic Gaussian with mean 0, variance=1
return -(x**2 + y**2) / 2.0
def ais(num_samples=10000, num_steps=1000):
samples = np.random.randn(num_samples, 2) # from the base distribution
log_w = np.zeros(num_samples)
# Beta schedule
betas = np.linspace(0, 1, num_steps)
delta_betas = betas[1:] - betas[:-1]
for i in range(num_steps-1):
beta0, beta1 = betas[i], betas[i+1]
# Compute log unnormalized density difference
# which is [ -beta1 * E + beta1 * log base ] - [ -beta0 * E + beta0 * log base ]
# Because AIS is bridging from base to target
e_vals = np.array([energy_function(s[0], s[1]) for s in samples])
log_base = np.array([base_log_prob(s[0], s[1]) for s in samples])
log_w += -beta1 * e_vals + beta1 * log_base - (-beta0 * e_vals + beta0 * log_base)
# MCMC step (very naive: small random walk or something more advanced)
# We'll do a small Gaussian move
proposals = samples + 0.01 * np.random.randn(num_samples, 2)
# Accept/reject with Metropolis-Hastings based on the intermediate distribution
current_energy = beta1 * np.array([energy_function(s[0], s[1]) for s in samples]) - beta1 * np.array([base_log_prob(s[0], s[1]) for s in samples])
proposal_energy = beta1 * np.array([energy_function(p[0], p[1]) for p in proposals]) - beta1 * np.array([base_log_prob(p[0], p[1]) for p in proposals])
# Metropolis acceptance
accept_prob = np.exp(-(proposal_energy - current_energy))
rand_u = np.random.rand(num_samples)
accepts = rand_u < accept_prob
samples[accepts] = proposals[accepts]
# Estimate ratio
z_base = 1.0 # The normalizer for the base distribution (2D Gaussian = 2*pi for sigma=1, ignoring constants for example)
w = np.exp(log_w)
z_est = np.mean(w) * z_base
return z_est
if __name__ == "__main__":
z_est = ais(num_samples=20000, num_steps=500)
print("Estimated partition function:", z_est)
This snippet is not optimized, but it should convey the essence: we start from an easy-to-sample distribution (the base), gradually anneal toward the target distribution, track the change in unnormalized log probabilities, and combine them into a single importance weight. The method ends up with an estimate of . Of course, for a real model, the energy function and base distribution would be replaced accordingly, and the MCMC steps would be carefully tuned.
This concludes our in-depth journey through the partition function, the difficulties it imposes, and the wide array of methods used to either approximate it or circumvent it altogether. I hope this article has shed some light on why the partition function merits its own dedicated discussion: from Contrastive Divergence to Noise-Contrastive Estimation to sophisticated AIS procedures for explicit computation, the partition function is the pivotal factor in many advanced probabilistic models. Understanding it, and understanding the techniques for dealing with it, is key to mastering energy-based modeling and a host of other undirected graphical models in modern machine learning.