Bayesian networks

Bayesian networks

The true probabilistic inference as it is

#️⃣   ⌛  ~1.5 h 📚  Advanced

16.01.2025

upd:

#145

Bayesian networks

The true probabilistic inference as it is

⌛  ~1.5 h

#145

🎓 63/167

This post is a part of the Probabilistic models & Bayesian methods educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

As machine learning systems become more sophisticated and find applications in high-stakes domains — such as healthcare, financial forecasting, and autonomous driving — the need to quantify and manage uncertainty becomes increasingly paramount. Deterministic neural networks, which learn fixed parameters (weights and biases), offer powerful predictive capabilities but rarely provide insight into how confident these predictions are. In many real-world scenarios, risk assessment and reliability are as important as accuracy. For instance, a medical diagnosis model that outputs only a single class label ("benign" vs. "malignant") without a calibrated measure of uncertainty might lead to suboptimal or even dangerous decisions.

This is where probabilistic modeling enters the picture. By incorporating probability distributions directly into our models, we gain the ability to estimate uncertainty: we can learn not just a single best guess, but rather a distribution over plausible hypotheses. This extra information can lead to improved decisions under uncertainty and more robust modeling of complex phenomena. Instead of answering "What is the single most probable outcome?" a probabilistic model tries to capture "Which outcomes are likely, and how certain am I?"

From deterministic weights to probability distributions

Traditional feed-forward neural networks represent their learnable parameters (weights and biases) as single point values found by optimization — for example, via gradient descent on a loss function. In a Bayesian neural network (BNN), however, each parameter is endowed with a prior distribution. When we observe data, we use Bayesian inference to compute a posterior distribution over the parameters, reflecting the updated state of our knowledge about them. Crucially, this posterior distribution captures our uncertainty about parameters, taking into account the complexity of the model, the amount of data, and the noise inherent in the observations.

When we feed a new input into a BNN, we integrate over all plausible parameter values (weighted by their posterior probability) to produce the so-called predictive distribution. Rather than a single point prediction, we obtain a distribution that can give a measure of how likely each possible outcome is. This distribution also allows us to compute credible intervals, predictive intervals, or other measures of uncertainty.

Scope and structure

In this article, we will cover:

Bayesian foundations: A thorough revisit of Bayes' theorem, the interplay of prior and likelihood in forming a posterior, and essential concepts like posterior predictive inference.
Distinction between Bayesian networks and Bayesian neural networks: We will introduce Bayesian networks — directed acyclic graphs that encode conditional dependencies among variables — as well as neural networks that embed uncertainty over parameters.
Building Bayesian neural networks: Practical aspects of constructing BNNs in frameworks like PyTorch and Pyro, focusing on how to place probability distributions over parameters.
Posterior estimation: Methods to handle the typically intractable integrals that arise in Bayesian models — covering Markov chain Monte Carlo (MCMC) techniques, Hamiltonian Monte Carlo (HMC), No-U-Turn sampler (NUTS), as well as Variational Inference (VI).
Advanced topics and best practices: Large-scale Bayesian networks, alternative approximate inference, network structure considerations such as d-separation and explaining away, etc.
Practical uncertainty estimation: Comparing point-estimate NNs vs. BNNs, deep ensembles, Monte Carlo dropout, conformal prediction, plus calibration metrics and reliability diagrams.

By the end, you should have a solid foundation in how Bayesian networks in general — and Bayesian neural networks in particular — can improve model reliability by capturing uncertainty.

Historical context and frequentist vs. Bayesian views

While Bayesian methods date back centuries (reviving from the work of Thomas Bayes, Pierre-Simon Laplace, and others), their popularity in modern machine learning has ebbed and flowed. Early AI approaches frequently used Bayesian reasoning to handle uncertainty in expert systems. With the advent of more data, frequentist methods and purely data-driven approaches like deep neural networks gained significant traction. However, as the demand for uncertainty estimation has grown, Bayesian approaches have re-emerged. Key references include the classic works on Bayesian belief networks (Pearl, 1988) and the pioneering Bayesian neural networks research from the early 1990s (Neal, 1996). More recent advances in scalable inference (e.g., variational inference and specialized MCMC variants) have made Bayesian neural networks more tractable for large datasets, spurring renewed research interest.

Frequentist and Bayesian approaches differ fundamentally in how they treat parameters and data: frequentists see parameters as fixed but unknown quantities, whereas Bayesians treat parameters as random variables with prior distributions. When new data is observed, Bayesians update those distributions according to Bayes' theorem. The rise of specialized Bayesian software, along with improved computational power, has lowered the barriers to building Bayesian models on large real-world problems.

Bayesian foundations

Revisiting Bayes' theorem

At the heart of Bayesian inference lies Bayes' theorem. If we denote $\theta$ as the parameters of a model and $D$ as observed data, Bayes' theorem states:

p(\theta \mid D) = \frac{p(D \mid \theta)\, p(\theta)}{p(D)}.

$p(\theta)$ is the prior distribution, describing our beliefs (or assumptions) about $\theta$ before observing $D$ .
$p(D \mid \theta)$ is the likelihood, describing how probable it is to observe $D$ if the parameters are $\theta$ .
$p(\theta \mid D)$ is the posterior distribution, encoding our updated belief about the parameters after observing data $D$ .
$p(D)$ is the evidence or marginal likelihood, which acts as a normalizing constant ensuring that $p(\theta \mid D)$ is a valid distribution.

In practice, $p(D)$ is often intractable to compute directly because it involves integrating over all possible parameter values. This difficulty motivates approximate inference methods such as MCMC and Variational Inference.

Prior, likelihood, posterior, and posterior predictive

Each component in the Bayesian pipeline has a specific role:

Prior: Conveys domain knowledge or assumptions about $\theta$ before data is observed. For example, if we assume parameters are likely to be small, we might choose a zero-centered Gaussian prior.
Likelihood: Specifies how the observed data $D$ is generated conditional on $\theta$ . In a regression setting, we might assume Gaussian observation noise, whereas in classification we might use a Bernoulli or softmax likelihood.
Posterior: Combines prior and likelihood to reflect new understanding after seeing data. Bayesian updating means shifting from prior beliefs to posterior beliefs in response to evidence.
Posterior predictive distribution: To predict a new data point $x_{\text{new}}$ , or its label $y_{\text{new}}$ , we integrate over the posterior:
$p(y_{\text{new}} \mid x_{\text{new}}, D) = \int p(y_{\text{new}} \mid x_{\text{new}}, \theta)\, p(\theta \mid D)\, d\theta.$

This captures all parameter uncertainty when making predictions, unlike point estimates that use a single "best" set of parameters.

Bayesian networks vs. Bayesian neural networks

Bayesian networks (sometimes referred to as Bayesian belief networks or graphical models) are directed acyclic graphs (DAGs) whose nodes represent random variables and whose edges represent conditional dependencies. Formally, a Bayesian network is a graph $G = \langle V, E \rangle$ with vertices $v \in V$ and directed edges $(u, v) \in E$ indicating that $X_v$ depends on $X_u$ . Each node has a conditional probability distribution that encodes how it depends on its parents in the graph. The joint distribution over all variables then factorizes according to the graph's structure (the chain rule factorization).

Bayesian neural networks share the same conceptual foundation of Bayesian inference — yet they place distributions specifically over the weights (and possibly biases) of a neural network. While a Bayesian network can be seen as a structured representation of conditional dependencies among random variables, a Bayesian neural network looks more like a "usual" neural network architecture for function approximation, but with priors on each weight parameter.

In short:

Bayesian networks: explicit graph structure with nodes representing random variables and edges capturing direct dependencies. Commonly used in knowledge representation, causal reasoning, or hierarchical modeling tasks.
Bayesian neural networks: standard NN architectures where each parameter is considered a random variable with a prior. The "graph" structure in a BNN is essentially the computational graph of the neural network rather than a DAG among observed and latent variables in the classical sense of a Bayesian network.

Despite these differences, both frameworks rely on the fundamental principle of Bayes' theorem to combine prior knowledge with observed data.

Chain rule factorization and conditional independence

A key property of Bayesian networks is that the joint distribution factorizes over the nodes:

p(X_1, \ldots, X_n) = \prod_{i=1}^{n} p(X_i \mid \text{parents}(X_i)).

Here, $\text{parents}(X_i)$ indicates the parent nodes of $X_i$ in the DAG. This factorization is sometimes called the "chain rule for Bayesian networks." It significantly reduces the complexity of representing the joint distribution, especially under conditional independence assumptions encoded by the DAG.

Conditional independence is a powerful concept. If a variable $X$ is conditionally independent of $Y$ given $Z$ , we have:

p(X, Y \mid Z) = p(X \mid Z) \, p(Y \mid Z).

Bayesian networks exploit these structured independencies to simplify inference. In a well-designed network, the presence or absence of edges strongly constrains the possible factorizations of the joint probability.

In neural networks, there is no explicit notion of a DAG for random variables in the same sense, but the parameter vector can be thought of as a set of random variables that generate predictions. Conditioned on the parameters, the outputs become deterministic (or follow some parametric likelihood). Still, BNNs can exhibit phenomena reminiscent of "explaining away," one of the hallmark behaviors in Bayesian networks.

Conjugate priors and Bayesian updating

A conjugate prior is a prior distribution that, when combined with a certain likelihood function, yields a posterior of the same family. This property greatly simplifies analytical updates. For example, a Beta prior combined with a Bernoulli likelihood yields a Beta posterior, or a Normal prior combined with a Normal likelihood on the mean yields a Normal posterior (with updated parameters). In real-world Bayesian neural networks, the interplay between weights and data is often too complex for neat conjugate forms, leading us to rely on approximate or numerical inference methods.

Bayesian updating is the process of taking a prior $p(\theta)$ and arriving at the posterior $p(\theta \mid D)$ by multiplying the prior by the likelihood of the data. Each new dataset can be folded in sequentially, refining beliefs as we go. In principle, this is straightforward, but in practice, integrals can be intractable, and posterior distributions can be high-dimensional and multimodal.

Common Bayesian pitfalls

Despite the conceptual clarity, Bayesian modeling can suffer from practical pitfalls:

Improper priors: Overly vague or unbounded priors can yield posteriors that are not well-defined.
Model misspecification: If the chosen likelihood or prior fails to capture the true data-generating process, the posterior might be systematically biased.
Computational complexity: In high-dimensional parameter spaces — such as large neural networks — exact Bayesian inference is generally infeasible. Approximate methods may require significant computational resources.

These challenges underscore why Bayesian networks and Bayesian neural networks require careful design and robust approximations.

Key concepts of Bayesian neural networks

Representing weights and biases as probability distributions

In a Bayesian neural network, each weight $w_i$ and bias $b_j$ is assigned a probability distribution, e.g. a Gaussian with some mean and variance. Rather than storing a single numeric value for each parameter, we store a distribution that evolves as data is processed. Concretely, if $\theta$ represents the entire parameter set:

\theta = \{\ldots, w_i, \ldots, b_j, \ldots \},

we might place a prior $p(\theta)$ factorized as:

p(\theta) = \prod_{i} \mathcal{N}(w_i \mid 0, \sigma^2_w)\, \times \prod_{j} \mathcal{N}(b_j \mid 0, \sigma^2_b),

or a more general distribution that encodes complex dependencies among parameters. The final result is that the BNN no longer has a single feed-forward pass; predictions are integrated over the posterior distribution of parameters. We can sample a set of parameters from the posterior to get a distribution of predictions.

Estimating uncertainty and predictive distributions

The ultimate reason to go Bayesian is to estimate uncertainty in predictions. Suppose we want the probability $p(y^* \mid x^*, D)$ of a new output $y^*$ given a new input $x^*$ and training data $D$ . In a BNN:

p(y^* \mid x^*, D) = \int p(y^* \mid x^*, \theta)\, p(\theta \mid D)\, d\theta.

Because $p(\theta \mid D)$ is generally high-dimensional, we approximate the integral by Monte Carlo sampling or by deriving a tractable approximation such as variational inference. Each sample from $\theta \sim p(\theta \mid D)$ yields a different neural network instance, and averaging predictions over many draws provides an approximation to the predictive distribution. The variance of that distribution is a measure of epistemic uncertainty (i.e., model uncertainty), while any noise in the likelihood (e.g., Gaussian observation noise) reflects aleatoric uncertainty.

Likelihood functions for regression and classification

In a regression task, it is common to assume that the observed outputs $y$ are drawn from a Normal distribution whose mean is given by the neural network's output, and whose variance is either fixed or also inferred:

p(y \mid x, \theta) = \mathcal{N}\bigl(y \mid f_\theta(x), \sigma^2\bigr).

For classification, a Bernoulli or categorical/softmax likelihood is typical. For example, in binary classification:

p(y \mid x, \theta) = \text{Bernoulli}\bigl(y \mid \text{sigmoid}[f_\theta(x)]\bigr),

while in multi-class classification:

p(y \mid x, \theta) = \text{Categorical}\bigl(y \mid \text{softmax}[f_\theta(x)]\bigr).

Explaining away — connection to probabilistic graphical models

"Explaining away" is a phenomenon where the presence of one plausible cause for an observed effect can diminish the posterior probability of other potential causes. In Bayesian networks with multiple parent nodes pointing to a common child node, observing the child can introduce dependencies among parents that were previously independent. For instance, if a patient's fever can be caused by either the flu or food poisoning, once we know the patient definitely has the flu, the probability of food poisoning as a second cause may drop, even if initially they were considered independent causes of fever.

Bayesian neural networks can exhibit related behaviors: if multiple parameters can explain the same patterns in data, inferring a certain configuration might reduce the posterior probability of other configurations. Although the "graph" in a BNN is the architecture of the neural network, correlation structures among weights often lead to inter-causal or explaining-away effects.

Hyperparameter tuning and prior selection

In Bayesian neural networks, hyperparameters such as the prior variance control how "spread out" the parameter distributions are initially. If the prior is too narrow, the BNN might become overly confident and fail to capture the full range of plausible hypotheses; if too wide, the posterior might underfit the data or become multi-modal in ways that hinder sampling and optimization. Selecting priors often involves domain expertise — knowing whether parameters are likely to be large or small, or if certain layers require different constraints.

Common prior choices include:

Gaussian (e.g., $\mathcal{N}(0, \sigma^2 I)$ ): The simplest, reflecting an assumption that parameters are near zero but can vary in either direction.
Laplace (akin to L1-type regularization): Encourages sparsity.
Hierarchical or structured priors: Introduce relationships among parameters, e.g. kernel-based or group-level priors.

Building a Bayesian neural network

Simulating data and problem setup

To illustrate how Bayesian neural networks work, one often starts with a synthetic regression or classification problem. For instance, generating a wiggly function with added noise in certain intervals, then trying to fit a BNN so that it generalizes well outside the observed region while faithfully expressing high uncertainty there.

You might do something like:


import numpy as np

def simulate_data_regression(num_points=200):
    x = np.linspace(-1, 1, num_points)
    noise = 0.2 * np.random.randn(num_points)
    y = np.sin(2 * np.pi * x) + noise
    return x, y

In classification tasks, you can sample from known distributions or create toy examples (e.g., circles, spirals) to test how well a BNN captures complex decision boundaries.

Model architecture: shallow vs. deep BNNs

A shallow BNN may have a single hidden layer with a small number of units, making it easier to demonstrate the inference process (such as MCMC sampling). Deeper models with multiple layers and more hidden units can capture richer function approximations but require more advanced or more computationally expensive inference methods.

Shallow BNN: Often used in introductory tutorials to show how weights become distributions.
Deep BNN: Potentially more expressive but also more challenging to train. Large-scale BNNs can require specialized approximations.

Gaussian priors on weights and biases

Arguably the most common prior assumption is the isotropic Gaussian prior:

$w_i \sim \mathcal{N}(0, \sigma^2)$ and $b_j \sim \mathcal{N}(0, \sigma^2)$ ,

for some $\sigma$ controlling how wide the distribution is. This implies we expect parameters to be near zero unless the data strongly suggests otherwise. Simpler still, one might use $\sigma^2 = 1$ as a default, though in practice you might tune or place a hyperprior on $\sigma^2$ .

Implementation details in PyTorch

Implementing BNNs in vanilla PyTorch can be done by manually specifying priors and performing MCMC or variational inference. However, you would need to write a fair amount of boilerplate code — managing distribution objects for each parameter, sampling them, computing the log probabilities, etc. This is a major reason for using high-level probabilistic programming frameworks such as Pyro or TensorFlow Probability.

If you do attempt it in pure PyTorch, you might:

Initialize parameter tensors $w$ and $b$ with requires_grad=True.
Define a log_prior(w, b) function that sums the log densities of the prior for each parameter.
Define a log_likelihood(x, y, w, b) function that computes the log of $p(y \mid x, w, b)$ .
Combine them into log_posterior(w, b) = log_prior(w,b) + log_likelihood(...).
Then run MCMC or VI updates to approximate the posterior.

Introduction to Pyro for Bayesian inference

Pyro is a probabilistic programming language built on PyTorch that automates much of the above. You specify a model function describing how data is generated, typically with calls like:


import pyro
import pyro.distributions as dist

def model(x, y):
    w = pyro.sample("w", dist.Normal(0., 1.))
    ...
    with pyro.plate("data", size_of_dataset):
        pyro.sample("obs", dist.Normal(...), obs=y)

You also specify a guide function if doing variational inference, or choose an MCMC kernel if using sampling approaches. Pyro then orchestrates the parameter updates or sampling procedures. Because it is integrated with PyTorch, it supports GPU-accelerated tensor operations, automatic differentiation, and sophisticated neural network modules.

Practical tips for network initialization

When placing distributions over weights, initialization can matter. If the prior scale $\sigma$ is large and the network is deep, forward passes can blow up easily, or gradient-based updates can become unstable. Common heuristics include:

Setting prior means to zero or small random values.
Setting prior variances (e.g., $\sigma^2$ ) proportionally to the fan-in of each layer.
Using smaller network architectures initially to debug inference procedures.

Posterior estimation methods

Markov chain Monte Carlo (covered before)

MCMC is a family of algorithms for sampling from complex, high-dimensional distributions — such as the posterior $p(\theta \mid D)$ . The idea is to construct a Markov chain whose stationary distribution is the desired posterior. Common MCMC approaches used for BNNs include Metropolis-Hastings, Hamiltonian Monte Carlo, and the No-U-Turn Sampler.

While MCMC can provide asymptotically exact samples (given enough time), it can be slow to converge and scale poorly to huge datasets or very deep networks. Techniques like mini-batching are more complicated with MCMC but are possible in some specialized forms of stochastic gradient MCMC.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses gradient information of the log-posterior to guide proposals in parameter space. Think of $\theta$ as a particle moving in a potential energy landscape defined by the negative log-posterior. By simulating the Hamiltonian dynamics, HMC can perform larger, more informed jumps through parameter space, often reducing the random-walk behavior that plagues vanilla Metropolis-Hastings.

In Pyro or Stan, HMC is implemented through methods that automatically compute gradients with respect to the parameters. However, HMC can still be quite computationally expensive for large networks.

No-U-Turn sampler (NUTS)

The No-U-Turn sampler is an extension of HMC that eliminates the need to hand-tune the trajectory length (the number of leapfrog steps). NUTS adaptively chooses when to stop the trajectory so that it does not "turn back" on itself, automating a crucial hyperparameter. This makes HMC more efficient, especially in high dimensions.

Diagnosing convergence

When using MCMC, we must ensure the chain has converged to a stationary distribution. Common diagnostics include:

Trace plots: Visual inspection of parameter samples over iterations.
Gelman–Rubin statistic ( $\hat{R}$ ): Compares variance between multiple chains to variance within each chain. If $\hat{R} \approx 1$ , the chains are likely converged.
Effective sample size: Measures how many effectively independent samples are obtained, accounting for autocorrelation.

If the chain is not mixing well, we might see poor effective sample sizes and $\hat{R}$ far from 1.

Variational inference (VI)

Variational inference offers a deterministic alternative to MCMC by reframing the inference problem as an optimization task. We choose a family of tractable distributions $q_\phi(\theta)$ , typically factorized, then find the parameters $\phi$ that minimize the KL divergence $\mathrm{KL}(q_\phi(\theta) \,\|\, p(\theta \mid D))$ . Because we cannot compute $p(\theta \mid D)$ directly, we instead maximize the Evidence Lower BOund (ELBO):

\text{ELBO}(\phi) = \mathbb{E}_{q_\phi(\theta)}[\log p(D \mid \theta)] - \mathrm{KL}[q_\phi(\theta) \,\|\, p(\theta)].

This approach can scale to large datasets using stochastic gradient-based optimizers, but the approximation depends on the flexibility of the chosen family $q_\phi(\theta)$ .

Mean-field variational inference

Mean-field VI is the simplest variant, where $q_\phi(\theta)$ factorizes across parameters. For instance:

q_\phi(\theta) = \prod_{i} q_{\phi_i}(w_i).

Each $w_i$ might have a distinct Gaussian distribution parameterized by a mean and variance. While computationally convenient, mean-field approximations can underrepresent correlations among parameters, possibly leading to an overconfident posterior.

Stochastic gradient and the ELBO

Variational inference typically relies on gradient-based optimization. We can write:

\nabla_\phi \text{ELBO} = \nabla_\phi \mathbb{E}_{q_\phi(\theta)}[\log p(D \mid \theta)] \;-\; \nabla_\phi \mathrm{KL}[q_\phi(\theta) \,\|\, p(\theta)].

Using the "reparameterization trick" or other gradient estimators, we approximate the expectation by drawing samples $\theta \sim q_\phi(\theta)$ . Because each iteration is typically much faster than a full MCMC iteration, VI can handle bigger models more easily.

AutoDiagonalNormal and other Pyro guides

Pyro provides convenient "auto-guide" classes that automatically create a parametric family $q_\phi(\theta)$ . For instance, AutoDiagonalNormal places an independent Gaussian distribution on each parameter dimension. More advanced guides exist, e.g. AutoMultivariateNormal, normalizing flows, or hierarchical structures, that can better capture correlations.

Comparing MCMC and VI in practice

MCMC: Potentially more accurate asymptotically; can approximate multi-modal posteriors. But can be slow, and difficult to scale.
VI: Typically faster and more scalable, especially for large models and datasets. But can yield biased or too "simple" approximations if the variational family is not expressive enough.

Many researchers use whichever method is more tractable or whichever best matches their computational constraints. Hybrid approaches, or sophisticated flow-based variational distributions, can narrow the gap.

Updating the posterior with new observations

In principle, we can treat newly arrived data as a second inference step:

p(\theta \mid D_{\text{old}}, D_{\text{new}}) \;\propto\; p(D_{\text{new}} \mid \theta)\; p(\theta \mid D_{\text{old}}).

This is straightforward conceptually, but not always easy in practice if the prior or posterior is complex. For MCMC, we could continue sampling with the updated likelihood. For VI, we can initialize a new variational distribution from the old posterior's parameters and continue optimizing with new data. This is sometimes referred to as Bayesian updating or online Bayesian learning.

Practical uncertainty estimation

Comparing point estimate NNs and BNNs

A point estimate neural network uses a single set of weights found by (for example) maximum likelihood or maximum a posteriori. If you plot predictions, the model may look extremely certain even in regions where there is little or no data. A Bayesian neural network, by contrast, typically shows high predictive uncertainty in data-scarce regions, reflecting limited information about the correct parameter settings.

Deep ensembles (Lakshminarayanan and gang, 2017) train multiple independent neural networks from random initializations or different data folds. The ensemble average can mimic a Bayesian posterior by capturing multiple modes in parameter space, though it is not strictly a Bayesian procedure. Nevertheless, deep ensembles often yield impressive uncertainty estimates in practice, can be simpler to implement than BNN-specific methods, and scale well with modern hardware.

Monte Carlo dropout as a Bayesian approximation

Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) interprets dropout at test time as sampling from an approximate posterior over weights. By leaving dropout layers active, each forward pass yields a different "thinned" network. Repeating multiple forward passes and averaging yields a predictive distribution. This method is easy to implement (simply do not disable dropout at test time) and can produce well-calibrated uncertainties in some cases, though it might not be as powerful as a full Bayesian approach or as stable as a carefully tuned ensemble.

Conformal prediction: theory and usage

Conformal prediction (Vovk and gang) is a frequentist-driven approach to constructing prediction intervals or sets, guaranteeing certain coverage properties under mild assumptions. Unlike BNNs or ensembles, conformal prediction does not require changing the training procedure itself. Instead, it uses a held-out calibration set to compute a "nonconformity score," thereby building a set or interval for new observations guaranteed to have coverage $1 - \alpha$ (marginal coverage). This approach can be combined with any predictive model — Bayesian or not — to produce intervals that are valid in finite samples (assuming exchangeability).

Trade-offs in computational cost and performance

BNNs can yield rich posteriors but can be expensive to train via MCMC or advanced VI.
Deep ensembles can be trivially parallelized by training multiple networks, but require additional memory.
MC dropout is easy to incorporate but might degrade raw performance if dropout significantly alters the training dynamics.
Conformal approaches are model-agnostic but require separate calibration steps and might produce intervals that fail to capture some structural uncertainties.

Calibration and reliability diagrams

A well-calibrated model has the property that its predicted probabilities match empirical frequencies. For instance, among all predictions assigned a 70% probability of being correct, roughly 70% should be correct. Reliability diagrams plot predicted probability against empirical accuracy. Many Bayesian methods do not guarantee perfect calibration out-of-the-box, but in practice, they tend to calibrate better than purely deterministic point-estimate networks. Techniques like temperature scaling can further refine calibration of predictive distributions.

Advanced topics

d-separation and active trails

In classical Bayesian networks (graphical models), d-separation is a criterion that tells us whether a set of observed variables "blocks" every path between two unobserved variables, thereby implying conditional independence. If there is no active trail (path) between two variables given the evidence, then those variables are conditionally independent given that evidence. Specifically:

$X$ and $Y$ are said to be $d$ -separated by $Z$ if, in the graph $G$ , every path from $X$ to $Y$ is blocked by $Z$ .

For example, consider the so-called "V-structure," $X \rightarrow W \leftarrow Y$ : here, $X$ and $Y$ are marginally independent, but they become dependent once $W$ is observed — this is the classic "explaining away" phenomenon.

While BNNs do not typically frame their dependencies through explicit DAGs of observed variables, the concept of partial correlation among parameters is conceptually related to whether certain sets of parameters "block" or "activate" dependencies within the network's representation of data.

Large-scale Bayesian neural networks

Scaling BNNs to massive architectures — e.g., modern convolutional or transformer networks — remains an area of active research. Naive MCMC can become infeasible for extremely large networks. Variational inference, especially with structured or flow-based approximate posteriors, is more promising at large scales. Another approach is to adopt a hybrid: use a deterministic backbone for most layers and only treat certain layers or subsets of parameters as Bayesian.

Complex prior distributions, e.g., hierarchical priors

We are not constrained to isotropic Gaussian priors. We can design structured priors that encourage correlations among parameters — for instance, a hierarchical prior for weight matrices that share patterns across different layers or channels. Such priors can lead to better uncertainty estimates and can incorporate domain knowledge (e.g., images have spatial correlation, wavelet coefficients might have sparse structure, etc.).

Alternative approximate inference: normalizing flows, etc.

Variational distributions can be made more flexible by using normalizing flows or invertible transformations that map simple base distributions (like Gaussian) into more complicated shapes. This can capture multi-modality or heavy tails in the posterior. Flow-based VI can approximate posteriors more accurately than simple mean-field approaches, albeit at higher computational cost.

Explaining away and inter-causal reasoning

We have mentioned "explaining away" in the context of Bayesian networks, but it can also manifest in Bayesian neural networks. When multiple sets of parameters can explain the data, observing the data can cause the posterior mass to concentrate more heavily on one set of parameters, decreasing probability assigned to alternative sets. This can be viewed as inter-causal reasoning: the presence of one cause (set of parameters) makes the other less necessary to explain the effect (the observed data).

Transfer learning and Bayesian fine-tuning

In many deep learning applications, it is common to start with a model pretrained on a large dataset and then fine-tune it on a smaller target dataset. Bayesian approaches can incorporate uncertainty from the pretrained model by using its weights as a prior or by adopting some hierarchical structure that captures how the new data updates the old parameters. Bayesian fine-tuning can lead to robust adaptation, especially when target data is limited.

Additional implementation frameworks and best practices

JAX, TensorFlow Probability, and others

Beyond PyTorch + Pyro, other frameworks offer probabilistic programming or Bayesian neural network capabilities:

TensorFlow Probability (TFP): Tools for building Bayesian models and performing VI or MCMC with TensorFlow.
JAX-based libraries: Haiku, Flax, NumPyro, and other ecosystems that combine JAX's auto-differentiation with probabilistic tools.
Stan: A powerful probabilistic language mostly used for classical Bayesian models, though sometimes used for smaller Bayesian NNs.
Edward2: An experimental interface for TFP with higher-level constructs for Bayesian neural networks.

Optimization tricks and debugging tips

Gradual unfreezing: Sometimes it helps to fix certain parameters, then unfreeze them as the inference progresses.
Learning rate schedules: Because we are optimizing an ELBO or running an MCMC chain, the step size can have a big impact. In HMC, a too-large step size leads to high rejection rates; in VI, it can cause divergence.
Checking variance: Keep an eye on the scale of parameter distributions. If they explode, you may need to reduce the prior variance or re-initialize.
Intercept correlated parameters: For advanced networks, consider more expressive approximate posteriors that capture correlation among weights.

Model selection and comparison

Selecting among Bayesian models can be done by comparing marginal likelihoods or approximate model evidence, though this is often computationally difficult. Alternatives include:

Widely Applicable Information Criterion (WAIC): A generalization of AIC and DIC for Bayesian models.
Bayes factors: The ratio of marginal likelihoods for two models.
Predictive performance: In practice, many just compare predictive metrics like RMSE, log-likelihood, or calibration error on a validation set.

Reproducibility and experiment tracking

Due to the inherent stochasticity of sampling-based approaches, it is crucial to:

Use fixed random seeds (though note that some GPU computations might be nondeterministic).
Log MCMC traces and diagnostic statistics.
Track hyperparameters of priors and inference algorithms meticulously.
Store final posterior samples or fitted variational distributions for later inspection and reproducibility.

Applications and case studies

Regression tasks — time series, noisy function approximation

Bayesian neural networks are especially useful in regression tasks with limited data or high uncertainty. For time series forecasting, a BNN can produce credible intervals that expand as we forecast further into the future — reflecting the accumulation of uncertainty over time. In noisy function approximation (e.g., modeling physical processes), the BNN can separate measurement noise (aleatoric) from model uncertainty (epistemic).

Classification tasks — MNIST, distribution shift detection

In classification, BNNs can flag out-of-distribution inputs by showing large predictive uncertainty. For instance, on MNIST digit classification, a BNN can produce high-entropy predictions for images that do not look like typical handwritten digits (e.g., random noise or letters). This is beneficial for real-world applications that must detect anomalies or reject uncertain predictions.

Medical, financial, and other real-world applications

Clinical diagnosis must often account for the costs of false positives and false negatives. A Bayesian model can incorporate domain knowledge about disease prevalence (the prior) and provide well-calibrated posteriors that reflect how uncertain it is about a diagnosis — imperative for medical decision making. In finance, capturing uncertainty about future market behavior can help risk management. In robotics or self-driving cars, Bayesian methods can help to quantify and reduce collisions or planning errors by incorporating uncertainty in sensor readings.

Interpreting and visualizing uncertainty

Visualizing the posterior predictive distribution often involves plotting a mean prediction plus credible intervals (e.g., ±2 standard deviations). One can also visualize the distribution of network parameters or the distribution of predictions on a test set. Tools such as reliability diagrams help check calibration, while dimension-reduced embeddings of posterior samples can hint at multi-modal distributions.

Performance metrics in real-world scenarios

When uncertainty matters, standard metrics like accuracy or MSE are insufficient. We might consider:

Brier score: A proper score that measures the accuracy of probabilistic predictions.
Log-likelihood / Log probability: Summation or average of $\log p(y_i \mid x_i, \theta)$ .
Calibration error: e.g. Expected Calibration Error (ECE) or reliability diagrams.
Coverage: For intervals or sets, what fraction of true data is covered by the predicted intervals/sets?

Conclusion

Key takeaways and lessons learned

Uncertainty matters: Bayesian frameworks provide a systematic way to incorporate and update uncertainties, crucial for risk-sensitive domains.
Bayesian networks: Encode the factorization of a joint distribution in a DAG, capturing conditional independencies and enabling structured reasoning about latent and observed variables.
Bayesian neural networks: Extend neural networks by placing distributions over parameters, yielding powerful function approximators that reflect uncertainty in their predictions.
Inference: Exact Bayesian inference is typically intractable for high-dimensional models, but approximate methods like MCMC and variational inference provide practical solutions — each with trade-offs in computational cost, accuracy, and complexity.

Open challenges and future research directions

Scalability: MCMC for large models remains challenging, though specialized methods continue to appear.
Expressive approximate posteriors: Flow-based or implicit distributions can capture richer posterior structures but are computationally intensive.
Automated prior specification: Deciding "good" priors can be nontrivial, especially for very deep networks with tens of millions of parameters.
Multi-modal distributions: Real posteriors in deep models may be multi-modal. Handling these systematically remains an open research area.
Integration with big data: Stochastic gradient MCMC and distributed inference are areas of active research for data at web scale.

References and recommended reading

Below are several sources for further exploration. In addition, many references are mentioned inline throughout the text.

D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.
Y. Gal, Z. Ghahramani. "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." ICML, 2016.
C. Robert. The Bayesian Choice. 2nd ed. Springer, 2001.
A. Kendall, Y. Gal. "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" NIPS, 2017.
A. Lakshminarayanan, and gang "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." NeurIPS, 2017.
Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, Sriram K. Rajamani. "Probabilistic programming." FOSE 2014.

Final thoughts on the Bayesian perspective

Bayesian neural networks and Bayesian networks elegantly combine foundational probability theory with modern machine learning. They empower practitioners to encode prior knowledge, rigorously update beliefs in light of data, and reason about uncertainty for safer and more interpretable AI. While there are still computational and conceptual hurdles, the field is rapidly evolving, and the fundamental ideas — grounded in Bayes' theorem — remain as relevant as ever for robust, trustworthy machine learning.

Having walked through the motivations, mathematical foundations, computational methods, and practical issues, you now possess an extensive overview of Bayesian networks and Bayesian neural networks, including how to build them, how to approximate posteriors, and how to interpret the resulting uncertainty. In the broader machine learning landscape, these ideas represent a crucial step forward in developing models that both fit data and acknowledge what they do not know.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content