Graphical models

Graphical models

This is just the beginning of the real ML

#️⃣   ⌛  ~1.5 h 📚  Advanced

25.01.2025

upd:

#146

Graphical models

This is just the beginning of the real ML

⌛  ~1.5 h

#146

🎓 62/167

This post is a part of the Probabilistic models & Bayesian methods educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

Graphical models are a powerful family of probabilistic models that represent complex dependencies among sets of random variables using graphs, where the nodes typically correspond to these random variables, and the edges (or lack of edges) capture conditional dependencies and independencies. The essential insight is that, rather than trying to reason about an unwieldy joint probability distribution directly, one can decompose it into simpler factors that correspond to substructures in the graph. This decomposition often provides a more intuitive understanding of any associations among the variables, as well as a computational framework for inference.

We say a "graphical model" because the structure of the graph encodes crucial information about how variables interact. If two variables do not share a direct connection (an edge), then there is a notion of conditional independence given the rest of the variables in the model. Hence, these graphical structures help dissect high-dimensional distributions into manageable components. Graphical models appear across many domains, including speech recognition, computer vision, natural language processing, and computational biology. They provide a unifying language that fosters synergy with other subfields in machine learning and artificial intelligence.

A recognized strength of graphical models is that they are designed to work gracefully in settings with partial observability, missing data, or uncertain evidence. Because they rest on formal probability theory, they accommodate both parameter estimation and inference about unobserved variables systematically. As the graphs become more complicated, so do the interactions among variables, but the underlying theory ensures we can derive approximate or exact inference procedures. Some approaches are more tractable than others, and for sufficiently large or complex graphs, approximation methods become necessary.

The ability to model a rich array of relationships sets graphical models apart from simpler, less flexible models. With the right structure, we may inform the model that certain variables depend on each other in particular ways, but not on others. This leads to a more parsimonious model by reducing the number of parameters needed to describe the data. From a machine learning perspective, this often improves interpretability. For instance, one might design a graphical model in a hierarchical form, effectively capturing domain-specific knowledge, such as a disease-outbreak model that reflects known epidemiological pathways of transmission.

Moreover, graphical models unify what might otherwise be seen as disparate approaches: Bayesian networks fall under directed graphical models, Markov random fields reflect undirected graphical structures, and factor graphs generalize both by introducing a bipartite architecture connecting factors and variables. As such, an appreciation of graphical models grants deeper insight into a broad variety of machine learning and data-analysis algorithms.

1.2 historical context

The modern history of graphical models is often traced back to several distinct fields converging on the same fundamental ideas. Early developments in statistical physics (particularly the Ising model and Markov random fields), in the mid-20th century, paved the way for using undirected graphs to represent systems of interacting variables. In artificial intelligence, methods for knowledge representation blossomed in the 1970s and 1980s, advancing graphical structures like semantic networks or frames that partially overlapped with the concept of directed graphs for representing conditional dependence.

During the 1980s, Judea Pearl's foundational work in probabilistic reasoning fused these strands in a coherent framework that we now interpret as Bayesian networks. This was a major milestone, since it demonstrated how uncertainty and knowledge could be systematically combined with conditional independence assumptions to yield efficient inference algorithms. Pearl's d-separation concept revolutionized how researchers reasoned about conditional independence in directed acyclic graphs. Meanwhile, the Hammersley-Clifford theorem formalized the notion of representing multivariate distributions through products of potential functions defined on cliques of undirected graphs, anchoring the development of Markov random fields.

By the 1990s, graphical models had become a major force in machine learning, particularly through the introduction of the junction tree algorithm, belief propagation, and the development of factor graphs. The synergy between these ideas and classical state-space models—like hidden Markov models for speech recognition—spurred broader adoption, illustrating how chain-structured graphs enable efficient inference. As major conferences such as NeurIPS and ICML grew, so too did the community's attention to advanced inference algorithms, approximate methods for large-scale datasets, and integrative approaches that merged graphical models with neural architectures.

This growth and cross-pollination has continued through the 21st century. Various theoretical advancements—like those in message passing (sum-product and max-product algorithms), sampling-based inference (Gibbs sampling, Metropolis-Hastings), variational methods, and structural learning—have been widely adopted. Furthermore, specialized hardware and the advent of deep probabilistic programming frameworks have spurred new applications and research directions, ensuring that graphical models remain a central topic in modern AI research and practice.

1.3 relationship to other machine learning paradigms

Graphical models often get compared to other well-known paradigms, such as neural networks or decision trees. Typically, neural networks are understood as parameterized, differentiable functions that learn complex mappings from inputs to outputs through gradient-based optimization. On the surface, they differ significantly from graphical models, which emphasize explicit factorization of joint probability distributions and direct representation of conditional independencies. However, recent developments in deep generative models—like variational autoencoders—have begun to bridge the gap by incorporating graphical model concepts within layered neural architectures. Researchers have explored hybrid approaches that combine the interpretability of graphical structures with the feature-extraction power of deep networks (for instance, through neural message passing in factor graphs).

Decision trees, on the other hand, learn hierarchical splits in the feature space and can be seen as a form of directed acyclic graph. However, they usually do not adopt the explicit semantics of conditional independence that define Bayesian networks or Markov random fields. Ensembles of such trees—even in random forests or gradient boosting—usually do not express a single coherent joint distribution but instead average or boost multiple predictive rules.

Compared with purely discriminative methods (like linear classifiers or logistic regression), graphical models are more general because they capture joint or conditional distributions in a structured manner. This allows for tasks like missing data imputation, latent variable inference, or structured prediction in ways that purely discriminative methods do not natively support. Indeed, structured prediction tasks—such as labeling sequences, segmenting images, or modeling relational data—are typically best handled by frameworks rooted in the general formalism of graphical models.

An additional advantage is that graphical models can incorporate prior knowledge about a domain's structure. For instance, if domain experts indicate that certain variables are conditionally independent given others, we can encode that directly in the network's topology. This transparency can be essential in high-stakes applications like medicine, where interpretability and a rigorous accounting of uncertainty are paramount.

1.4 notation and terminology

In the realm of graphical models, one typically designates a set of random variables $\{X_1, X_2, \ldots, X_n\}$ to describe a system of interest. The graphical model is represented by a graph $G$ with nodes corresponding to these random variables. Edges (directed or undirected) convey probabilistic dependencies. When employing a directed graph—like in a Bayesian network—edges point from parent nodes to child nodes, reflecting the factorization $p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{Pa}(X_i))$ , where $\text{Pa}(X_i)$ denotes the set of parents of $X_i$ in the graph. For undirected models—like Markov random fields—the joint distribution often takes the form:

p(\mathbf{X}) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \phi_C(\mathbf{X}_C),

where $\mathcal{C}$ is the set of cliques or maximal cliques, and $\phi_C$ is a potential function over the variables in each clique $C$ . The term $Z$ is the normalizing constant (also known as the partition function), ensuring the distribution sums or integrates to 1.

We often use the phrase "factor graph" to indicate a bipartite representation that separates variable nodes and factor nodes (i.e., potential or conditional distribution functions). This approach highlights how the joint distribution factorizes into smaller pieces (factors) that each involve a subset of variables.

Common terminology includes:

Nodes: They represent random variables or sets of variables.
Edges: They connect nodes that directly depend on each other in some manner.
Potential functions: In the undirected case, real-valued mappings that define how strongly certain variable configurations are weighted.
Conditional probability distributions: In directed graphs, each node's distribution given its parents.
Clique: A fully connected subset of nodes in an undirected graph.
Plate notation: A shorthand representation used to concisely denote repeated structures.

Throughout this article, I will alternate between these representations, always indicating which graph type (directed, undirected, or factor graph) is being used.

2. transition from bayesian networks

2.1 revisiting bayesian networks

Bayesian networks (BNs) are perhaps the most widely recognized example of directed graphical models. They encode the joint distribution of a collection of variables $X_1, \ldots, X_n$ through a product of simpler conditional probability distributions:

p(X_1, \ldots, X_n) = \prod_{i=1}^{n} p\bigl(X_i \mid \text{Pa}(X_i)\bigr),

where $\text{Pa}(X_i)$ denotes the parent nodes of $X_i$ in the directed acyclic graph (DAG). Such factorization typically allows us to avoid enumerating exponentially large joint tables.

A BN's structure can conveniently represent a domain with natural causal or hierarchical relationships. For example, in a simple medical diagnosis network, "Disease" might be a parent of "Symptoms," capturing that each Symptom node is conditionally dependent on the presence or absence of the Disease. The graph remains acyclic, which is crucial because cycles can lead to well-defined but more intricate forms of representation that traditional BN inference algorithms cannot directly handle.

Bayesian networks lend themselves to a variety of inference algorithms, such as variable elimination or the junction tree algorithm, that exploit local conditional independences. If the network is sparse (few edges), inference can be efficient. However, real-world problems can be quite dense or require cross-loops, in which case a BN might become unwieldy.

2.2 generalization to graphical models

Although Bayesian networks are powerful, they are essentially a subset of the general umbrella of graphical models. Graphical models can be directed, undirected, or hybrids. Undirected graphical models (e.g., Markov random fields) can be easier to interpret in contexts where symmetrical relationships exist among variables (like in certain spatial models). Furthermore, advanced structures like factor graphs unify both directed and undirected representations in a bipartite manner. In these factor graphs, the fundamental building blocks are factors—each factor might correspond to a local conditional probability distribution (directed) or a clique potential (undirected).

This broad perspective addresses a wide range of real-world situations. Indeed, many problems do not follow a natural directed flow; instead, they revolve around mutual constraints or symmetrical interactions among variables. In those scenarios, Markov random fields and factor graphs become more natural frameworks. The notion of "messages" that pass between variables and factors is a hallmark of such generalizations, forming the basis of widely used algorithms like belief propagation.

2.3 limitations of bayesian networks

While Bayesian networks are extremely useful, they come with limitations:

Directed edges impose a partial ordering: One must define a DAG, so each variable is assigned a direction relative to other variables. This can be unnatural if there is no inherent causal or directional structure in the domain.
Cyclic dependencies are not captured: If certain variables interact in feedback loops (e.g., in statistical physics or certain system dynamics), the DAG framework will not capture those cycles without additional design.
Potentially large parameter sets: For many discrete Bayesian networks, each child node might have a large conditional probability table that depends on all combinations of parent variables, which can become vast even with moderate numbers of parents.
Inference complexity: As the network grows in complexity or connectivity, inference might become computationally challenging. Approximate methods exist, but they can be slow to converge or fail to provide tight guarantees in certain cases.

2.4 motivations for expanding beyond bayesian networks

Given the limitations above, researchers were motivated to consider frameworks that could model more general sets of interactions, symmetrical constraints, or clique-based factorizations. Markov random fields embody one such approach, removing directional edges and focusing on how variables might form cliques within an undirected network. Meanwhile, factor graphs further unify these paradigms by representing the joint distribution in terms of local factors—functions that constrain subsets of variables—and variable nodes that collectively define the entire distribution.

These expansions also tackled real-world challenges in computer vision, where one frequently encounters smoothness constraints—neighboring pixels or regions of an image are likely to have similar properties. An undirected model elegantly encodes these "neighboring constraints," capturing the local correlation among pixel labels for segmentation or denoising tasks. Indeed, Markov random fields soon became a mainstay in that domain, together with specialized inference algorithms like graph cuts and belief propagation.

3. d-separation

3.1 concept of d-separation

In a directed graphical model such as a Bayesian network, determining when two variables are conditionally independent is essential for both theoretical and computational reasons. The d-separation criterion is a set of rules introduced by Judea Pearl for systematically ascertaining conditional independencies specified by a DAG. In informal terms, d-separation helps us follow all possible paths between two variables $X$ and $Y$ to see whether conditioning on a third set of variables "blocks" or "unblocks" those paths.

A path in the DAG is said to be blocked if it contains a chain or fork structure $A \to B \to C$ or $A \leftarrow B \to C$ in which the middle node $B$ is conditioned on, or a collider structure $A \to B \leftarrow C$ where $B$ is not conditioned on and none of its descendants are conditioned on either. If all paths from $X$ to $Y$ are blocked, then $X$ and $Y$ are d-separated, implying $X \perp Y \mid \text{Condition}$ . If at least one path is open (unblocked), $X$ and $Y$ remain dependent given the conditioning set.

3.2 algorithm for d-separation

To apply d-separation algorithmically, one typically follows this procedure:

Identify all paths between the two target variables, $X$ and $Y$ .
For each path, categorize each edge relationship as a chain ( $\to \to$ ), a fork ( $\leftarrow \to$ ), or a collider ( $\to \leftarrow$ ).
Determine whether conditioning on your set of variables "blocks" or "unblocks" each path.
If every path is blocked, $X$ and $Y$ are d-separated and hence conditionally independent. If any path remains unblocked, they are dependent.

This might sound cumbersome, but efficient algorithms can be employed, especially for large networks, using graph-theoretical techniques that systematically check for blocked paths. The real power of d-separation is conceptual: it clarifies how a Bayesian network encodes a great many conditional independencies via the structure of the DAG, which in turn shapes the inference and factorization properties of the model.

3.3 implications for inference

Once we know that two variables are d-separated given some evidence, we also know that they cannot influence each other's probability distribution once that evidence is observed. This drastically simplifies the model, allowing inference algorithms to ignore the possibility of additional dependencies. The computational efficiency of methods like variable elimination or message passing is directly tied to these conditional independence relationships, which effectively reduce the dimension of integration or summation needed to compute posterior distributions.

In large networks, d-separation can be used as a structural tool to detect whether certain manipulations—such as interventions or the addition of edges—affect the independence structure. This is crucial when building or refining a Bayesian network based on domain knowledge, ensuring the final DAG encodes all desired (in)dependencies. In more advanced contexts, such as causal inference, d-separation forms the foundation for algorithms that compute the effect of interventions or check identifiability under different assumptions.

3.4 practical applications of d-separation

D-separation extends well beyond toy Bayesian networks. In real-world settings, it can determine how to group or cluster variables for parallelizable inference in large-scale systems. For instance, if a subset of variables is isolated from another subset (d-separated by some evidence), one can break the global inference problem into smaller tasks.

In practice, software libraries like PyMC, Pyro, and probabilistic programming frameworks often implement data structures to store the graph and reason about dependencies. When you create or modify a Bayesian network in these libraries, they implicitly keep track of which nodes are conditionally independent, potentially optimizing sampling or exact inference routines. This sort of capability is indispensable when dealing with large hierarchical models in fields like ecological modeling or medical diagnosis (e.g., learning from large-scale patient data with many correlated factors).

4. markov random fields

4.1 introduction to markov random fields (MRFs)

A Markov random field (MRF) is an undirected graphical model where edges indicate potential dependencies between variables. Formally, an MRF is defined over an undirected graph $G = (V, E)$ with a set of random variables $X_v$ for each vertex $v \in V$ . Edges $e \in E$ connect variables that share a direct dependency. The Hammersley-Clifford theorem states that any strictly positive distribution over $X$ that satisfies the Markov property (i.e., each variable is conditionally independent of the rest given its neighbors) can be factorized in terms of cliques of the graph:

p(\mathbf{X}) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \phi_C(\mathbf{X}_C),

where $\mathbf{X}$ is the full set of variables, $\mathcal{C}$ is the set of all cliques (or, in some versions, maximal cliques), $\phi_C$ is the potential function over the variables in clique $C$ , and $Z$ is the partition function $Z = \sum_{\mathbf{X}} \prod_{C \in \mathcal{C}} \phi_C(\mathbf{X}_C)$ .

4.2 structure and properties of MRFs

The key property of an MRF is that each node only depends directly on its neighbors in the undirected graph. The local Markov property states that $X_v$ is conditionally independent of all other variables given its immediate neighbors. This makes sense in many applications: for instance, in image modeling, each pixel is expected to depend only on its surrounding neighbors, expressing local smoothness assumptions. Text or language models leveraging adjacency constraints might rely on local interactions and more complicated global constraints for coherence.

MRFs can model phenomena without imposing any directional notion. Symmetry is often an appealing feature, such as in grid-like structures used for image segmentation (where each pixel is connected to its immediate neighbors). Because they do not specify a direction, they naturally capture cyclical or mutual dependencies that are awkward or impossible to encode in directed acyclic graphs.

4.3 relationship to bayesian networks

Bayesian networks and MRFs differ primarily in how they represent and factorize the joint distribution. BNs factorize according to topological ordering in a DAG, while MRFs factorize via cliques in an undirected graph. Sometimes, a given joint distribution can be represented by either a BN or an MRF, though the resulting graphs may appear structurally different. In other circumstances, certain distributions factorize more readily in a directed format, while others are simpler in an undirected format.

A related concept is the moralization of a Bayesian network, in which you "undirect" the edges and connect all parents of a child node. Moralizing a BN yields an undirected graph to which you can apply standard MRF-based inference if that helps. This process illustrates a close connection between these two families.

4.4 applications of MRFs

MRFs are widely used in spatial and temporal models. Two prime examples involve:

Computer vision and image segmentation: For tasks such as image denoising, one might define a pixel-lattice MRF that encodes the assumption that adjacent pixels should have high probability of sharing similar intensity or color values. The potentials $\phi_C$ quantify how certain pixel configurations are "preferred" or "disallowed."
Spatial statistics: Geostatistics applications might define MRFs over regions to capture correlations in geographical data, such as rainfall or vegetation measurements in adjacent areas.

Other well-known use cases include natural language processing (e.g., part-of-speech tagging, where each word's tag depends on neighboring tags), building probabilistic models for random fields in physics, and social network analysis (where each node might represent an actor or agent with ties to neighbors indicating a tendency to align behaviors).

5. relation to directed graphs

5.1 directed versus undirected graphs

When deciding whether to use a directed or undirected representation, the nature of the domain and the conceptual interpretation typically guide the choice. In a directed graph, an edge $X \to Y$ indicates that $Y$ is conditionally dependent on $X$ given the parents of $Y$ . The ordering (no cycles) might match real-world causal or time-based structures. Undirected graphs, on the other hand, do not encode direction, making them suitable for representing symmetrical relationships.

5.2 representational power of directed graphs

Directed models excel at modeling causal or generative processes, where one can read the graph from ancestors to descendants. For example, a hidden Markov model is a directed chain that unfolds over time. Each state's distribution depends on the previous state's state variable, and the emission depends on the current state. This sequential modeling is straightforward in a directed representation, naturally aligning with an unrolling in time.

5.3 representational power of undirected graphs

Undirected models are more natural for representing constraints that do not have an inherent direction, such as pairwise constraints in a grid or in a general adjacency structure. In image segmentation, there is no "direction" of correlation among neighboring pixels; rather, the pairwise potentials reflect local consistency. Similarly, in robust statistics or physics-based models, symmetrical potential functions might reflect energy states that do not care about the ordering of variables.

5.4 hybrid models

Hybrid graphical models, such as factor graphs, unify both directed and undirected aspects. In factor graphs, we have variable nodes and factor nodes, and edges connect each factor to the variables it touches. These factors can be conditional probability distributions (in a directed sense) or potential functions (in an undirected sense). Hybrid approaches allow for domains in which certain portions have a natural directionality (like a subset of variables that follow a causal chain), whereas other portions revolve around symmetrical relationships. Such flexibility has proven valuable in advanced applications, from error-correcting codes to sophisticated hierarchical Bayesian structures that incorporate random fields as part of a larger directed scheme.

6. inference in graphical models

6.1 overview of inference

Inference in graphical models refers to the process of answering queries about random variables given evidence. For example, we might want the probability $P(X \mid E)$ for some subset of variables $X$ given observed evidence $E$ . Additionally, we might be interested in the most likely configuration of unobserved variables given the data (the MAP estimate). Whatever the query, conducting inference in a graphical model can be approached through several algorithmic paradigms, including exact inference and approximate inference.

6.2 exact inference

Variable elimination

Variable elimination is an exact inference procedure that systematically sums (or integrates) out variables in a chosen order. By grouping factors and eliminating variables one by one, it reduces the computation compared to a naive approach. However, the method's complexity can still be large if the graphical model has many interdependencies, forcing repeated multiplications of large factors. Nevertheless, for tree-structured or sparse graphs, variable elimination can be tractable and quite efficient.

Belief propagation

Belief propagation (BP), especially in tree-structured graphs, is another exact inference method. In a tree or a polytree (a singly connected graph), BP passes "messages" along edges, updating beliefs about each node. The sum-product algorithm is a form of BP that computes marginal distributions, while the max-product algorithm computes the most probable joint assignment. On trees, it converges to the exact solution, whereas on graphs with cycles, it becomes loopy belief propagation—an approximate method we will discuss later.

6.3 approximate inference

When the graph is complex—containing many loops or large clique sizes—exact inference can be computationally intractable (NP-hard in general). Hence, approximate methods come into play:

Sampling-based techniques: Monte Carlo methods, such as Gibbs sampling, Metropolis-Hastings, or particle filtering, draw samples from the posterior distribution, approximating marginals or expectations with empirical averages.
Variational methods: These transform the inference problem into an optimization one, approximating the true distribution with a simpler family (like a mean-field approach).
Loopy belief propagation: Applies message passing in graphs with cycles, typically as a heuristic that often converges to a reasonable solution, although it may fail to converge in certain cases.

6.4 inference algorithms

In high dimensions, choosing an appropriate inference algorithm depends on the graph structure, the nature of the factors, and computational constraints. The junction tree algorithm is a prime strategy that effectively transforms the original graph into a tree of cliques (a clique tree), where belief propagation can be done exactly if the tree is not too large. Conversely, if the resulting clique tree has extremely large cliques, the method becomes infeasible, leading to approximate alternatives.

Many advanced applications (e.g., large-scale Bayesian networks or MRFs for computer vision) rely on specialized approximate algorithms—graph cuts, alpha-expansion, or mean-field approximations—to handle the combinatorial explosion. The choice often depends both on theoretical performance guarantees and empirical results in a given domain.

7. inference on a chain

7.1 graphical model structure of a chain

Chain-structured graphical models provide a simpler special case where the nodes $X_1, X_2, \ldots, X_n$ form a chain, and each node is connected only to its immediate predecessor and successor (if they exist). This structure arises prominently in time-series data, where $X_t$ depends on $X_{t-1}$ and influences $X_{t+1}$ . Hidden Markov models (HMMs) are a well-known example, with a chain of hidden states each yielding observations.

7.2 exact inference on chains

Chained graphs permit efficient dynamic programming approaches, such as the forward-backward algorithm in HMMs. Because the chain has no branches, we can compute $P(X_t \mid \text{evidence})$ in linear time with respect to $n$ . The technique systematically propagates messages forward (the forward probabilities) and backward (the backward probabilities), combining them to produce posterior distributions of intermediate states.

7.3 hidden Markov models (HMMs)

In an HMM, each observed variable $Y_t$ depends only on the corresponding hidden state $X_t$ , and each hidden state depends only on the previous hidden state, forming a directed chain. The distribution factorizes as:

p(X_1, \ldots, X_n, Y_1, \ldots, Y_n) = p(X_1)\prod_{t=2}^n p(X_t \mid X_{t-1}) \prod_{t=1}^n p(Y_t \mid X_t).

HMMs find application in speech recognition, part-of-speech tagging, and various time-series modeling tasks. Because of their chain structure, standard inference is tractable with dynamic programming (forward-backward), and learning the parameters can be done via the Baum-Welch (EM) algorithm.

7.4 applications of chain-based inference

Besides HMMs, chain-based inference covers simpler linear chain conditional random fields (CRFs) for sequence labeling, or Markov models for stochastic processes. In NLP, for example, a chain-structured CRF might label each token in a sentence (e.g., as a named entity or not) in a manner that properly accounts for adjacent context. Similarly, in computational biology, chain structures appear in sequence alignment or protein secondary structure prediction when each residue is influenced by its neighbor.

8. factor graphs

8.1 definition of factor graphs

A factor graph is a bipartite graph that explicitly separates variables from the factors (or functions) that depend on them. On one side of the bipartite graph, we have nodes representing the variables $X_1, \ldots, X_n$ . On the other side, we have factor nodes $f_1, \ldots, f_m$ , where each factor $f_i$ is a function of a subset of variables. The model's joint distribution factorizes as:

p(\mathbf{X}) = \frac{1}{Z} \prod_{i=1}^m f_i(\mathbf{X}_{\text{scope}(f_i)}),

where $\text{scope}(f_i)$ denotes the set of variables on which factor $f_i$ depends, and $Z$ is a normalization constant.

8.2 relationship to other graphical models

Factor graphs can represent both Bayesian networks and Markov random fields by breaking down their factorization into smaller functional pieces. For a Bayesian network, each conditional probability distribution $p(X_i \mid \text{Pa}(X_i))$ can be seen as a factor. For an MRF, each clique potential is seen as a factor. This bipartite representation often simplifies the implementation of message-passing algorithms like belief propagation or the sum-product algorithm.

8.3 message passing in factor graphs

The sum-product algorithm is the quintessential method for inference in factor graphs. Each variable node maintains a belief about its marginal distribution, while each factor node maintains messages that encode how strongly that factor favors different values of the variables. In a tree-structured factor graph, sum-product yields exact marginals. If the graph has cycles, the same algorithm may be applied as loopy belief propagation, which yields approximate solutions.

8.4 applications of factor graphs

One of the earliest and most notable applications of factor graphs was in error-correcting codes, particularly the sum-product decoding algorithm for low-density parity-check (LDPC) and turbo codes. Factor graphs are also widely used in advanced machine learning scenarios, such as structured prediction in CRFs or in deep probabilistic architectures that maintain a factorization of certain layers. Their modular nature makes them attractive for building large, compositional models in a bottom-up manner.

9. exact inference in general graphs

9.1 challenges in general graphs

When a graphical model is not tree-structured (i.e., it forms cycles), exact inference can become much more complex. Traditional belief propagation is no longer guaranteed to be exact, as messages might get passed around loops indefinitely. The clique factorization might involve large cliques that inflate the computational complexity exponentially.

9.2 junction tree algorithm

The junction tree algorithm is a key solution that transforms the original graph into a "junction tree," also called a "clique tree," in which each node of the tree is a maximal clique from the original graph. By constructing this tree appropriately (through a procedure called graph triangulation), one can run a version of sum-product message passing on the tree of cliques. This yields exact marginals if the model is fully captured by these cliques.

Formally, if $\{C_1, C_2, \ldots, C_k\}$ are the maximal cliques in your triangulated graph, the junction tree organizes them so that for any pair of cliques $C_i$ and $C_j$ that share variables $S = C_i \cap C_j$ (called a separator), all the cliques on the path between $C_i$ and $C_j$ in the junction tree also contain $S$ . This property is crucial for proper message passing and to ensure consistency of shared variables across the cliques.

9.3 variable elimination

Variable elimination in a general graph might also be used but is sensitive to the order in which variables are eliminated. The optimal elimination order can be as challenging to find as the inference problem itself, although heuristics (like min-fill or min-degree) can produce orderings that lead to smaller intermediate factors.

The complexity is often governed by the "treewidth" of the graph—loosely, how large the largest clique in the best possible triangulation is. If the treewidth is small, exact inference can be tractable; if the treewidth is large, we are forced to approximate.

9.4 clique tree approach

Conceptually, one can interpret the junction tree algorithm as building a clique tree, then running a specialized sum-product (or max-product) message passing on that tree. On each iteration, cliques send messages to adjacent cliques by marginalizing out variables not in the receiver clique. After enough message passes, each clique's belief stabilizes, and from these, we can extract exact marginal distributions for any subset of variables. This approach is at the heart of many advanced inference packages that handle complex, general graphs.

10. loopy belief propagation

10.1 loopy belief propagation (LBP) overview

Loopy belief propagation (LBP) is an adaptation of belief propagation to graphs with cycles. Instead of being guaranteed to converge to the correct marginals (as in a tree), it might enter a cycle of messages and fail to settle. In some cases, it does converge, but the result is not guaranteed to be the true marginal distribution. Despite these caveats, LBP is still widely used in practice due to its simplicity and often surprisingly good empirical performance.

10.2 convergence issues in LBP

The difficulties in LBP typically arise from strongly connected loops or contradictory constraints between nodes. While LBP can converge even in the presence of loops—sometimes very quickly—there is no universal guarantee. Researchers have proposed numerous modifications, such as damping (slowing down updates to avoid oscillations) and scheduling heuristics (using asynchronous updates or prioritizing certain messages), to encourage or force convergence.

10.3 applications of LBP

LBP powers many approximate inference tasks in computer vision, particularly stereo vision, image restoration, and segmentation problems, where the underlying graph is cyclical. It also finds application in coding theory (where the sum-product decoding of LDPC codes can be seen as LBP on a factor graph) and in large-scale Bayesian networks with loops. In many real situations, exact inference is computationally prohibitive, so an approximate method with an easy-to-implement message-passing routine is a practical alternative.

10.4 comparison with other approximate inference methods

Compared to variational methods, LBP is often more straightforward to implement and can handle discrete variables with complicated interactions. Variational approaches typically require designing a tractable family of approximate distributions and deriving an objective function to optimize. Monte Carlo sampling might be more general and flexible, but it demands many samples for convergence, and can be slow or suffer from mixing issues in high-dimensional spaces. Ultimately, the choice of approximate inference method often depends on the nature of the problem, the required accuracy, and computational constraints.

11. learning the graph structure

11.1 structure learning problem

While so far we have assumed that the structure of the graph is known, in practice, one might be faced with the challenge of discovering the underlying dependencies among variables from data. This is the structure learning problem: given a dataset $D$ of observations of random variables $X_1, \ldots, X_n$ , we want to learn both the parameters (potentials or conditional probability distributions) and the structure (edges, cliques, or factor scopes) of the graph.

11.2 approaches to structure learning

Broadly, structure learning approaches fall into two categories: score-based and constraint-based. Score-based methods define an objective function (like the Bayesian Information Criterion, BIC, or Akaike Information Criterion, AIC) that measures how well a candidate graph explains the data, penalizing complexity. One then searches over the space of possible graphs—often with heuristics like greedy hill climbing or local search—to find a structure with the best score.

Constraint-based approaches use conditional independence tests to determine which edges are necessary or redundant. Variables that are found independent given some subset of other variables should not be directly connected in the graph. The classic PC algorithm is an example, systematically applying independence tests and removing edges that do not pass them.

11.3 score-based learning

In a Bayesian network setting, a typical score might be the log-likelihood of the data plus a regularization penalty that grows with the number of parameters. This approach is intuitive because it balances how well the model fits the data with how complex or "expensive" the model is. Approaches like the Bayesian Dirichlet equivalent (BDe) score incorporate priors over network structures. Score-based methods can be extended to undirected models as well, though the complexity of computing the partition function $Z$ can make it challenging.

11.4 constraint-based learning

Constraint-based algorithms like the PC algorithm start by assuming a fully connected graph, then systematically test conditional independences among variables. If a pair of variables is independent given some subset of others, the algorithm removes the edge between them. The final adjacency structure is refined until all tested independences are satisfied. For large graphs, the number of potential independence tests can be significant, so heuristics and domain knowledge accelerate the process.

11.5 hybrid learning approaches

Hybrid methods combine score-based and constraint-based ideas, e.g., by first using constraints to prune the space of possible structures and then scoring the reduced set. Alternatively, one could use a constraint-based method to propose candidate edges and then apply a score-based method to refine or confirm them. These multi-stage approaches often strike a balance between computational feasibility and statistical fidelity, especially for large datasets.

11.6 applications of structure learning

Structure learning arises in a variety of fields:

Genetics: Genes and phenotypes are typically represented as nodes; edges characterize regulatory interactions. Learning the structure can help identify regulatory pathways or epistatic interactions.
Social networks: Inferring who influences whom in a social group might be cast as a structure-learning problem, especially in observational data.
Recommendation systems: Potentially, one can learn how user preferences and item characteristics interrelate, forming a factor graph or BN that captures both shared user tastes and item similarities.

12. advanced topics and current research

12.1 dynamic graphical models

Many real-world processes evolve over time, motivating dynamic extensions of both Bayesian networks and Markov random fields. Dynamic Bayesian networks explicitly model temporal transitions with time-slice to time-slice edges, often providing a generalization of hidden Markov models. In dynamic MRFs, one might have an undirected structure that shifts or grows over time, with new nodes (or factors) added as more data arrives.

12.2 nonparametric graphical models

Nonparametric methods free the model from the assumption that potentials or conditional distributions follow a certain parametric form (like a Gaussian or discrete table). Kernel-based potentials or Gaussian processes can be integrated into a graphical framework, though inference quickly becomes more complicated. For instance, one can define a Gaussian process prior over a function that forms part of a factor in the graph, leading to sophisticated but computationally heavy solutions.

12.3 deep learning and graphical models

Recent cutting-edge research blends deep learning with graphical-model concepts. Variational autoencoders and graphical flow-based models can sometimes be seen as factorized distributions with neural network components modeling certain conditional or potential functions. Meanwhile, graph neural networks (GNNs) harness ideas from message passing—akin to belief propagation—on arbitrary graphs. There has been a wave of interest (NeurIPS, ICML, JMLR) in using GNNs as "learnable" message-passing algorithms. These approaches often incorporate the interpretability and factor-structured design of graphical models with the representational capacity of deep networks.

12.4 graphical models in causal inference

Causal graphical models extend the Bayesian network formalism by emphasizing interventions and do-calculus. This approach, pioneered by researchers like Pearl, attempts to clarify how changing one variable (through an external intervention) affects others. By encoding assumptions about causal directions, researchers can differentiate correlation from causation. The combination of structure learning with causal discovery is extremely active: recent developments investigate how to learn directed acyclic graphs that reflect genuine causal linkages, opening the door to robust decision-making and policy analysis in fields like economics or epidemiology.

12.5 the future of graphical models

Graphical models remain an essential tool, with a vibrant research community. One sees progress in areas such as:

Scalable inference: Exploiting GPUs or specialized hardware to accelerate sampling or message passing.
Structured deep learning: Coupling graphical priors with flexible neural components, producing interpretable yet powerful architectures.
Online and streaming inference: Updating beliefs incrementally as new data arrives in real time.
Integrating symbolic and statistical AI: Melding structured knowledge representation with probabilistic reasoning in a single framework.

The widespread adoption of big data analytics, real-time systems, and sophisticated machine learning pipelines ensures that novel theoretical advances in graphical models will continue to find immediate application. Whether in error-correction coding, robotics, industrial process control, or natural language understanding, the fundamentals of graphical models—decomposing complex distributions based on structured graphs—promise to remain deeply relevant.

An image was requested, but the frog was found.

Alt: "An abstract representation of a Markov random field, showing nodes and undirected edges in a grid structure"

Caption: "Markov random fields often appear in grid-like forms for image segmentation tasks."

Error type: missing path

An image was requested, but the frog was found.

Alt: "A bipartite factor graph with multiple factor nodes and variable nodes"

Caption: "Factor graphs make it explicit which subsets of variables each factor depends on, facilitating message-passing algorithms."

Error type: missing path

Below is a short illustrative code snippet in Python for building and performing inference on a simple factor-graph-like structure using a probabilistic framework such as Pyro. This example is simplified to demonstrate the concept of factorization:

<Code text={`
import pyro
import pyro.distributions as dist
from pyro.infer import Importance, EmpiricalMarginal
from pyro.infer.mcmc import MCMC, NUTS
import torch

def model():
    # Suppose we have two factors f1(X1, X2) and f2(X2, X3)
    x1 = pyro.sample("x1", dist.Normal(0., 1.))  # prior
    # Factor f1 can be thought of as p(x2 | x1).
    # For example, let's define it as Normal(x1, 1.)
    x2 = pyro.sample("x2", dist.Normal(x1, 1.))
    # Factor f2 is p(x3 | x2), also Normal
    x3 = pyro.sample("x3", dist.Normal(x2, 1.))
    return x1, x2, x3

# Let's do a quick MCMC-based inference
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=2000, warmup_steps=500)
mcmc.run()

samples = mcmc.get_samples()
print("Mean of x1:", torch.mean(samples["x1"]))
print("Mean of x2:", torch.mean(samples["x2"]))
print("Mean of x3:", torch.mean(samples["x3"]))
`}/>

While this snippet does not explicitly show edges or factor nodes, it demonstrates how a probabilistic programming approach factors the joint distribution into multiple pieces (one factor per sample statement). Librarie such as Pyro, PyMC, or Stan effectively leverage the underlying ideas of factorization and conditional dependence to perform inference or sampling.

In conclusion, graphical models deftly encode intricate relationships in high-dimensional data and provide principled, probabilistic frameworks to reason under uncertainty. From classic Bayesian networks to modern neural message-passing, the fundamental notions of factorization, conditional independence, and graph structure remain indispensable in advanced data science and machine learning pipelines. These concepts empower one to design interpretable, flexible, and computationally amenable models that scale to the demands of real-world applications, while still preserving a solid theoretical core grounded in probability theory.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content