MoE architecture

MoE architecture

Actually a very smart architecture

#️⃣   ⌛  ~1.5 h 🤓  Intermediate

18.03.2025

upd:

#155

MoE architecture

Actually a very smart architecture

⌛  ~1.5 h

#155

🎓 116/167

This post is a part of the Specialized & advanced architectures educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

The notion of leveraging multiple specialized models — sometimes referred to as "experts" — to solve complex tasks is not entirely new in the domain of artificial intelligence. In fact, the first glimpses of such ideas trace back to early attempts in the 1980s and 1990s to build learning architectures that could each address a simpler portion of the problem space. In these formative years, researchers were motivated by the idea that a single monolithic model, no matter how large, might not efficiently capture the intricacies and variability often present in real-world data. A more fruitful approach, they speculated, would involve dividing the problem into smaller, more homogeneous chunks and then learning a specialized function for each part of the domain.

Historically, one can see parallels between mixture of experts and the older concept of ensemble methods, such as bagging or boosting, though the latter often revolve around adding multiple learners and aggregating them for improved accuracy. MoE (mixture of experts), however, is distinct in that there is typically a gating network that decides, for each input, which expert (or experts) should be responsible. Early works by Jordan and Jacobs (Neural Computation, 1991), Nowlan and Hinton (NIPS, 1990), and others laid essential theoretical foundations. They introduced gating functions and adaptive mixtures of local experts, where each expert would produce a local or partial solution, and the gating function would learn to weigh or select these experts depending on the input characteristics.

historical context and early research

In the early 1990s, researchers like Michael I. Jordan, Robert A. Jacobs, and Geoffrey Hinton were exploring ways to train systems that harness the power of specialized subnetworks. This period saw the introduction of hierarchical mixture of experts, as well as the adaptive mixtures of local experts approach. These approaches were heavily inspired by the concept of a "divide and conquer" strategy, where complicated data distributions are broken into simpler sub-distributions. Each sub-distribution is then tackled by a single expert that can better model it.

A key historical milestone was the realization that gating networks could effectively direct inputs to experts, facilitating specialization without demanding every sub-model to handle the entire complexity of the task. Another major leap occurred when researchers recognized the synergy between mixture of experts ideas and well-known statistical models like Gaussian mixture models. Indeed, many proofs and training methods for MoE paralleled the standard EM (expectation-maximization) framework used for mixture models.

inspiration from divide-and-conquer strategies

One way to understand mixture of experts is through the lens of divide-and-conquer strategies in algorithm design and problem-solving. Consider how one might handle a large, unwieldy problem in software engineering: it is almost always broken down into smaller modules, each focusing on a particular functionality. By the same token, MoE tries to break the input distribution into narrower, more manageable subspaces, effectively assigning each subspace to an expert that is highly competent in that slice.

For instance, in speech recognition tasks dating from the 1990s, one might have an expert for female speakers of a specific accent, another expert for male speakers of a different accent, and so on. In this way, each sub-model only has to master a smaller sub-problem, which can lead to better generalization and faster training times. That principle has persisted, and the contemporary large-scale mixture of experts models — in both natural language processing and computer vision — are expansions of the same basic notion.

why mixture of experts?

Modern machine learning has led to a proliferation of deep neural networks with billions or even trillions of parameters. While these large models tend to perform well across numerous tasks, they face practical issues related to training time, memory constraints, inference latency, and sometimes diminishing returns in terms of overall performance improvements.

Mixture of experts offers a compelling alternative: it introduces conditional computation, meaning that for a given input, only a small subset of the model's parameters (the relevant experts) are activated. This lowers the overall computational cost at inference time if the gating is designed to select only a subset of the experts.

Hence, the main motivation for MoE is twofold:

Increased model capacity vs. computation trade-offs. Instead of applying a full, massive model to every input, you effectively let a gating mechanism pick the relevant, specialized sub-models.
Handling diverse subproblems. Real-world datasets often contain numerous modes or regions that differ widely. By dividing these distinct subproblems among experts, each piece of data is handled by a sub-network that is particularly adept at that kind of input.

addressing model capacity vs. computation trade-offs

A single neural network with billions of parameters requires enormous compute resources to train, serve, and maintain. Even if the data demands such capacity, it might be inefficient to use all parameters for every sample. MoE addresses this limitation by "turning on" only the relevant experts. This dynamic can yield a network with a far larger overall parameter budget — since each expert can be large, and there can be many experts — while keeping the compute cost for any single input significantly smaller.

In large-scale language modeling, for instance, approaches like Switch Transformers highlight this advantage clearly. Despite having tens or hundreds of billions of parameters in total, each token or example sees only a fraction of that parameter count during forward propagation. This synergy between large capacity and comparatively modest per-sample computation remains a primary driver of modern MoE architectures.

handling diverse subproblems through specialized experts

Furthermore, in many domains, data is inherently diverse. Take a multilingual language model for example: each language or dialect might embody a unique subproblem, complete with its own lexical distributions and syntactic patterns. An MoE approach allows certain experts to specialize in one or more languages, building deeper competence within that domain. Meanwhile, the gating network identifies which language is being processed and ensures the appropriate experts handle the input.

Similarly, in image recognition tasks, different experts might specialize in distinct sets of objects, scale ranges, or image styles. The impetus behind MoE is always the same: leverage specialized sub-models to capture the nuances of different parts of the input space, rather than forcing a single model to handle it all in a monolithic way.

high-level idea

Conceptually, a mixture of experts architecture comprises:

Expert Networks. These are the specialized sub-models that produce their own outputs for a given input.
A Gating Network. This network takes an input and outputs weights (or discrete selections) that indicate which experts should be consulted and how heavily each should contribute.
Combining Mechanism. The system then aggregates the results from the chosen experts (e.g., by a weighted summation of their outputs or by selecting the single best expert).

This stands in contrast to standard ensemble methods (such as a standard random forest or a typical bag of multiple classifiers), which often do not have a dynamic gating to route inputs. Instead, those methods simply combine the outputs from all learners in an identical manner regardless of the input.

2. conceptual foundations

2.1. the expert networks

An "expert" in an MoE architecture is typically a learnable function specialized to capture certain statistical patterns in the data. In classical formulations, these experts might be small feedforward networks or linear models. In more recent deep learning contexts, experts can be:

Multilayer Perceptrons (MLPs). For simpler tasks or subproblems that do not require extensive feature extraction.
Convolutional Neural Networks (CNNs). For computer vision or image-based tasks, each expert can be a CNN specialized on certain image characteristics.
Recurrent Neural Networks (RNNs) / LSTMs. For sequential data such as time series or certain natural language tasks.
Transformers. For modern large-scale language modeling, each expert might be a feedforward sub-block in a Transformer layer.

The size and structure of each expert can vary significantly based on the complexity of the problem. In some advanced MoE setups, you might have a massive number of experts (ranging from tens to thousands), each of which is itself a deep network.

2.2. the gating network

Perhaps the core novelty behind mixture of experts is the gating network. While the experts themselves do the heavy lifting of mapping input $x$ to output $f_i(x)$ , it's the gate that decides how these experts' outputs should be combined.

Conceptually, the gating network is a function $w(x)$ that outputs a vector of mixture coefficients. Each entry $w_i(x)$ in that vector is typically non-negative, and they might sum to 1 if it's a probability distribution. For a soft mixture, the final output might be:

f(x) \;=\; \sum_i\, w_i(x)\, f_i(x).

where $i$ indexes over all experts.

Hard vs. soft gating.

In hard gating, the gating network selects only the single best expert (or top-k experts) for each input, effectively ignoring all others. This can drastically reduce the computational overhead, since each sample only goes to one or a few sub-models.
In soft gating, the gating network assigns fractional weights to each expert, and the final output is a sum of all expert outputs weighted by these fractions.

Both approaches have their pros and cons. Hard gating is typically more computationally efficient, while soft gating can often yield smoother training signals because of the continuous mixture.

2.3. combining the experts

When the gating outputs are used to combine different expert outputs, the effect is somewhat akin to ensemble averaging, but done in a dynamic, input-dependent manner. Instead of a uniform or fixed weighting of ensemble members, you have an adaptive gating that can shift the model's focus as needed.

In some modern large-scale MoE designs — like the Switch Transformer — only a single expert is used at a time (top-1). Others might use a top-2 or top-k approach. In these sparse mixture settings, the gating step is crucial for deciding which experts get activated. This leads to a conditional computation architecture, where the maximum computational load is drastically reduced compared to a dense approach that calls on all experts concurrently.

3. core architecture

3.1. moe building blocks

While specific mixture of experts architectures can vary widely, they all share a few core building blocks:

Input layer. The raw input $x$ is typically projected or encoded in some manner so that it's suitable for both the gating and the experts. This might be a simple embedding layer (in NLP tasks) or a series of convolutional layers in an image domain.
Expert layers (parallel sub-networks). A set of $n$ parallel networks $f_1, \ldots, f_n$ that each produce a candidate output. If the mixture is hierarchical, you can have multiple layers of gating and experts in a tree-like structure.
Gating layer. A gating function that takes $x$ (or some hidden representation of $x$ ) and produces gating coefficients. This gating output determines how the experts' results are aggregated.

Often, these building blocks are stacked in deeper architectures or interleaved with other components (e.g., self-attention layers in Transformers).

3.2. forward pass mechanics

In a forward pass, the following typically happens:

Gating network evaluation. The gating network processes the input $x$ (or a derived representation from earlier layers) to produce a distribution over the experts, $w(x)$ .
Expert execution. Each or a subset of the experts produce their individual outputs $f_1(x), \ldots, f_n(x)$ . In the case of sparse MoE, only the top-k experts might be invoked.
Output aggregation. The final MoE output is either: $f_{\text{MoE}}(x) \;=\;\sum_{i \in \mathcal{S}(x)} w_i(x)\, f_i(x),$ where $\mathcal{S}(x)$ is the subset of experts chosen (it could be the entire set for a soft mixture, or a top-k subset for a sparse mixture).

The gating distribution can be learned to reflect the diverse nature of the data. Over time, some experts might become specialized in certain regions of the input space, which helps direct future data points from that region to the correct sub-model.

3.3. sparse vs. dense experts

A crucial distinction in modern MoE research is whether the system is sparse or dense:

Dense MoE: Every expert is evaluated on every input, with outputs combined via soft weighting. This was popular in earlier MoE literature but scales poorly.
Sparse MoE: Only a small number of experts are selected (often 1 or 2) for each input, drastically cutting down on the required computation. This approach has fueled the explosion of extremely large MoE language models that can exceed a trillion parameters (in total) while keeping the cost per token relatively contained.

Naturally, sparse gating demands additional techniques to ensure the distribution of data is balanced among experts (otherwise, certain experts might never be selected). This leads to a variety of load-balancing strategies and auxiliary losses used in training these models.

4. training mixture of experts

4.1. objective function

The learning objective for a mixture of experts model typically includes:

Primary loss. For supervised tasks, this might be the cross-entropy loss in classification or a mean squared error for regression. The total model output is the weighted output of the experts according to the gating distribution, and training proceeds to minimize (or maximize in the case of likelihood) the standard task loss.
Regularization for gating. Many MoE systems add special terms to discourage degenerate solutions, such as gate imbalance (where only one or a few experts get picked). A typical strategy includes an auxiliary loss that encourages each expert to receive a relatively balanced workload over a batch or dataset.

Consider a typical supervised learning setting with training data (\{x_k, y_k\}_{k=1}^N). If we define the gating function as $w(x) = (w_1(x), \ldots, w_n(x))$ and each expert $f_i$ has its own parameters, the mixture output for sample $x$ is:

f_{\text{MoE}}(x) = \sum_{i=1}^n w_i(x)\, f_i(x).

The main supervised loss $\mathcal{L}$ can be expressed as:

\mathcal{L}(\theta) = \frac{1}{N}\sum_{k=1}^N \mathrm{Loss}\bigl(f_{\text{MoE}}(x_k), \;y_k\bigr),

where $\theta$ collectively denotes all parameters (gating parameters plus the parameters of all experts).

When using an additional load-balancing term, you might add something like:

\mathcal{L}_{\text{total}}(\theta) = \mathcal{L}(\theta) \;+\; \lambda\,\mathcal{L}_{\text{balance}}(\theta),

where $\mathcal{L}_{\text{balance}}$ is designed to push the gating distribution to use all experts. $\lambda$ is a hyperparameter controlling the trade-off.

4.2. gradient flow and routing

Given that the gating is often a discrete selection in sparse MoE (hard gating), training these systems can be tricky. Hard gating yields a non-differentiable step (the (\mathrm{top}\_k) function), complicating backpropagation. Researchers have explored various strategies:

Soft gating with temperature annealing. Start with a softmax-based gate that includes all experts, then gradually sharpen the distribution to approach a discrete choice.
Straight-through estimators or REINFORCE-like methods. These estimate gradients through discrete decisions.
Continuous approximations. Approaches like Gumbel-Softmax or other reparameterization tricks attempt to make discrete sampling more differentiable.

When gating is fully soft (i.e., a standard softmax weighting all experts), then the training is simpler — just standard gradient-based training. However, that typically reintroduces large compute costs at inference.

4.3. optimization challenges

Expert collapse. In naive training, a single expert might end up receiving almost all the training samples that the gating network sees, while other experts are starved of data. This is detrimental because only that single expert is effectively learning, and the capacity gains from multiple experts are never realized.
Balancing load. Even if gating attempts to distribute inputs, some experts might get flooded with more data than they can handle in a batch, leading to large variance in gradient updates.
Partial gradient updates. In a sparse MoE, not all experts see data on every batch. An expert might see only the data assigned to it, so effectively, it can go many iterations with no gradient signal if it's rarely chosen.

4.4. implementation details

In practice, especially in large-scale language models, gating is typically performed at the token level (i.e., each token in a sequence is routed independently) or at the batch level (the entire input is assigned to a single or small group of experts). Token-level gating is more fine-grained but also more complex to parallelize.

Modern frameworks have begun offering specialized ops for mixture of experts. For example, Google's TensorFlow Mesh, the Switch Transformer code in TensorFlow, or MoE modules in PyTorch. They manage the overhead of collecting tokens from across devices, distributing them to the relevant experts, and then returning the results.

5. variants and extensions of moe

5.1. hierarchical mixture of experts

A hierarchical MoE structure stacks multiple levels of gating networks in a tree. Each gating node splits the data distribution among children, eventually passing data down to leaf experts. The final mixture output might be:

f(x) = \sum_{i=1}^{N_1} w_i^{(1)}(x) \left( \sum_{j=1}^{N_2} w_{j|i}^{(2)}(x) \, f_{j|i}(x)\right),

where the upper gating weights $w_i^{(1)}(x)$ distribute data among sub-gating networks or sub-experts. Then each child gating function $w_{j|i}^{(2)}(x)$ further subdivides among its experts. This structure can be advantageous if your data distribution is naturally hierarchical — some tasks or subspaces might themselves be best subdivided further.

5.2. conditional computation & dynamic routing

One of the major draws of MoE is its ability to implement conditional computation: each input only activates a subset of the network. This concept can be extended beyond gating to other dynamic architectures:

Dynamic layers. Instead of having multiple parallel experts in a single layer, you might choose from different entire blocks of layers in a network.
Adaptive computation time. Some networks decide how many layers to apply based on the complexity of the input.

In all these approaches, the principle is the same: reduce the full model's compute cost by focusing only on the relevant components for a given input.

5.3. switch transformers

Switch Transformers, introduced by Fedus and gang (Journal of Machine Learning Research, 2022), represent a particular instantiation of MoE for large-scale language models. They restrict gating to select exactly one expert (top-1) out of many, drastically simplifying routing. They also introduced load-balancing losses that keep each expert from being over- or under-used.

Thanks to this approach, Switch Transformers could scale to extremely large parameter counts (in the hundreds of billions or even a trillion) while using relatively modest compute resources in practice. The gating function is typically a softmax over the linear projection of the token representation, with a top-1 selection.

5.4. moe in vision models

The concept of specialized experts has also found success in computer vision. For instance, Riquelme and gang (NeurIPS, 2021) introduced Vision MoE, applying the same gating-and-experts concept to feedforward blocks in Transformers built for image classification. Experts might be specialized in certain shapes, color distributions, or textures. Similarly, some designs incorporate gating that detects whether an image is related to a certain domain, picking the relevant sub-model.

6. practical considerations and best practices

6.1. hyperparameter tuning

When building and training an MoE system, the following hyperparameters become critical:

Number of experts $n$ : Larger $n$ means higher overall capacity but also more overhead in coordinating gating and parallelization.
Top-k (for sparse gating) or gating distribution shape: Hard gating approaches rely on the top-k selection. The value of $k$ significantly affects both performance and efficiency.
Learning rates: Often, the gating network might need its own separate learning rate or schedule so that it can quickly adapt the routing strategy.
Load-balancing penalty weight $\lambda$ : This can shift the gating distribution from heavily favoring a single expert to a more uniform distribution.

6.2. load balancing techniques

A major headache in MoE training is ensuring that experts are used in a balanced way. Common solutions include:

Auxiliary losses. Switch Transformers, for instance, introduced an additional term in the training objective to penalize imbalance.
Expert-bias-based balancing. Some research proposes letting each expert's bias automatically adjust if that expert is underused or overused.
Capacity constraints. Limit how many tokens each expert can process in a batch. If an expert is overloaded, some tokens are rerouted to the second-best choice.

6.3. regularization strategies

Overfitting can be a problem, especially if an expert becomes overly specialized on a rare sub-domain. Strategies include:

Dropout or dropout variants (like drop-connect) that act within each expert.
Weight decay specifically targeted at each expert's parameters.
Expert-level data augmentation, especially in tasks like image processing, ensuring each sub-expert sees enough variety.

An additional approach is to occasionally force the gating network to try different experts for an input, helping them generalize rather than focusing too narrowly on a single region.

6.4. hardware and memory constraints

MoE architectures can lead to extremely large parameter counts, easily surpassing what is typical for single-GPU training. Thus, specialized hardware setups or distributed frameworks are commonly employed. Key considerations:

Parallelization: Experts can be split across multiple GPUs or TPUs, with gating performed in a "dispatcher" node that sends relevant data to each expert's device.
Parameter server vs. all-reduce: In large-scale systems (e.g., GShard, Mesh TensorFlow), parameters might be sharded across many devices.
Memory footprint: Even if only a fraction of experts are active per sample, all experts must be stored. This can demand advanced memory management or partitioning strategies.

7. use cases and applications

7.1. natural language processing

MoE has become especially prominent in large-scale NLP. Models like GShard, Switch Transformer, and GLaM use massive numbers of experts to expand capacity while keeping per-token compute feasible. These models typically show improved perplexities on language modeling tasks, and can be finetuned or instruction-tuned to accomplish a wide range of downstream tasks.

The gating network in an NLP context might be token-level gating, where each token in a batch can be routed to a different set of experts. This approach is helpful for code-switching texts, multilingual corpora, or specialized domain adaptation (e.g., legal or medical text).

7.2. computer vision

In computer vision, MoE-based Transformers (Vision MoE) partition the feedforward layers into experts. They have been shown to scale up image classification tasks efficiently. In object detection or instance segmentation, you could imagine an MoE approach where certain experts are specialized in small objects, while others handle large or unusual objects.

7.3. recommender systems

MoE is also well-suited to recommender systems, which often have a diverse user base with varying interests. Different experts may specialize in distinct user segments. A gating network might input user features and item features, deciding which sub-model is best for that particular user-item pair. This can lead to improved personalization and more efficient training.

7.4. multi-task learning

When dealing with multiple tasks that share some underlying representation but also require unique specializations, MoE can be extremely effective. The gating network might incorporate a "task ID" or some representation of the task into its decision, letting certain experts handle certain tasks. By sharing experts among tasks (and allowing some experts to remain specialized for one task), multi-task performance can often be boosted.

8. common pitfalls and solutions

8.1. expert collapse

Symptom: Only a few experts end up being used regularly; the rest remain idle.

Possible fixes:

Use stronger load-balancing penalties.
Introduce random or forced exploration in gating decisions (for example, for some fraction of the time, pick a different expert).
Ensure that the gating network has enough capacity or is trained with a separate schedule so it can properly learn to utilize different experts.

8.2. overfitting in specialized experts

Symptom: An expert becomes too narrowly adapted to a very small portion of the input space, leading to poor generalization.

Possible fixes:

Increase data augmentation or regularization within that expert.
Encourage some degree of overlap between experts so that each sub-domain is handled by more than one expert.
Merge or prune underutilized experts.

8.3. training instabilities

Symptom: Large swings in gating distributions over the course of training, or gradient explosion/vanishing in experts.

Possible fixes:

Carefully tune learning rates for gating vs. experts.
Apply gradient clipping to expert parameters or gating parameters.
Use a warm-up or curriculum strategy so that gating distribution changes more gradually.

8.4. practical debugging tips

Inspect gating distributions. Track how often each expert is selected per batch or per token. If some experts are never being used, re-check load-balancing hyperparameters.
Visualize activation patterns. For smaller or illustrative tasks, create heatmaps of which experts are chosen for which inputs.
Check resource usage across experts. If you're using a distributed setup, ensure that GPU usage is balanced and no single device is a bottleneck.

9. future directions and research frontiers

9.1. combining moe with meta-learning

A fascinating direction is to make experts themselves adaptable. Instead of training a single fixed expert per region, one could incorporate meta-learning so that experts can quickly adapt to new tasks or data in their region of expertise. This would be especially potent in few-shot or continual learning scenarios, where new data distributions emerge over time.

With the rise of multi-modal AI — systems that handle text, images, video, or audio — an MoE approach can be used to route each modality to specialized experts. Even within a single modality, you might have experts focusing on different feature hierarchies. For instance, in a multi-modal architecture, you could have text experts, image experts, and audio experts, each being invoked only when the respective modality is present.

9.3. scaling to ultra-large models

We continue to see a race toward extremely large-scale models. By pushing the boundary of how many experts can be integrated, we can conceptually achieve "infinite capacity." The main hurdles remain practical: how to store these experts, how to effectively route data among them in a distributed environment, and how to keep training stable as model sizes explode.

Research also involves new routing strategies that might better handle large batch sizes and tricky distributions of data. We might see more advanced or learned scheduling algorithms that use optimization approaches or even reinforcement learning for routing.

Below, I expand on several additional topics and incorporate illustrative examples, code snippets, and LaTeX formulas to further deepen your understanding of mixture of experts. I also include references to the advanced theories behind the gating process and the mathematics of training. By doing so, you can clearly see how these ideas coalesce into state-of-the-art architectures like Switch Transformer, GLaM, and other large-scale MoE systems.

extended discussion: deeper theoretical underpinnings

bayesian viewpoint

Mixture of experts can be viewed under a Bayesian lens: the gating function's output $w_i(x)$ can be interpreted as the prior probability that expert $i$ is correct for input $x$ . Meanwhile, the expert's output can be considered its predictive distribution for the label or target. One can combine these to form a posterior for each expert's correctness.

In the simplest classical case (like the adaptive mixtures of local experts by Jacobs and Jordan), each expert is a Gaussian distribution parameterized by $\mu_i$ (and possibly $\Sigma_i$ ). The gating network yields $w_i(x)$ , the prior. After seeing the actual label $y$ , you can do a Bayesian update.

This interpretation clarifies how experts become specialized — once an expert invests in a certain region of the input-output space, the gating network shifts more probability to it whenever the input is recognized as belonging to that region.

additional latex example

Suppose you have a classification setting with $C$ classes, and each expert $i$ produces a probability distribution $p_i(y|x)$ . Then the MoE output distribution is:

p(y|x) = \sum_{i=1}^n w_i(x)\,p_i(y|x).

The gating network parameters $\theta_0$ , and each expert's parameters $\theta_i$ , are jointly learned by minimizing the negative log-likelihood:

-\log p(y|x) = -\log\Bigl(\sum_{i=1}^n w_i(x)\, p_i(y|x)\Bigr).

One challenge here is that the sum over experts is inside a log, leading to potential gradient difficulties when $w_i(x)$ is near zero for many experts. This is part of what spurred interest in sparse gating, where typically only the top-k terms in the sum are used.

extended discussion: advanced routing strategies

example routing code (python)

Below is a simplified demonstration of how one might implement a top-k gating mechanism in Python (pseudo-PyTorch style) for a single MoE layer with hard gating:


import torch
import torch.nn as nn
import torch.nn.functional as F

class GatingNetwork(nn.Module):
    def __init__(self, input_dim, num_experts):
        super().__init__()
        self.linear = nn.Linear(input_dim, num_experts)
    
    def forward(self, x):
        # x: [batch_size, input_dim]
        # output: [batch_size, num_experts]
        logits = self.linear(x)
        return logits

class Expert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # x: [batch_size, input_dim]
        h = F.relu(self.fc1(x))
        out = self.fc2(h)
        return out

class MixtureOfExperts(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_experts, k=1):
        super().__init__()
        self.num_experts = num_experts
        self.k = k
        self.gate = GatingNetwork(input_dim, num_experts)
        self.experts = nn.ModuleList([
            Expert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)
        ])
    
    def forward(self, x):
        # x: [batch_size, input_dim]
        gate_logits = self.gate(x)  # [batch_size, num_experts]

        # Let's do a top-k selection for each instance in the batch
        # gate_logits is shape [B, n], we want top-k for each row
        topk_vals, topk_inds = torch.topk(gate_logits, self.k, dim=1)
        # We'll apply a softmax among the chosen k
        # Then we'll compute the output from each chosen expert
        batch_size = x.shape[0]
        out = torch.zeros(batch_size, self.experts[0].fc2.out_features, device=x.device)

        for b in range(batch_size):
            # Get indices of top-k experts
            inds = topk_inds[b]   # shape [k]
            vals = topk_vals[b]   # shape [k]
            # Softmax among the top-k
            sm_vals = F.softmax(vals, dim=0)  # shape [k]
            
            # Compute each expert's output and weigh it
            for i, exp_idx in enumerate(inds):
                e_out = self.experts[exp_idx](x[b].unsqueeze(0)) # shape [1, output_dim]
                out[b] += sm_vals[i] * e_out.squeeze(0)
        
        return out

In this snippet:

GatingNetwork is a linear module that produces $n$ logits, one for each expert.
topk is used to pick the top- $k$ experts for each sample in the batch.
A local softmax is then performed on those top-k logits so that they sum to 1.
Only those experts are run, and their outputs are combined proportionally to the gating probabilities.

Note that this code is a simplified illustration. Real-world implementations handle larger batch sizes in parallel, manage distributed training, and incorporate additional balancing losses. They might also store the input tokens for each selected expert in a single contiguous buffer to speed up matrix multiplications.

extended discussion: specialized applications

speech processing

In speech recognition, different speakers, accents, or environmental conditions can be drastically distinct. An MoE architecture might create experts that each focuses on male voices vs. female voices, or noisy vs. clean audio. Over time, the gating network learns that certain acoustic features map to a specialized sub-model that better handles that type of input.

anomaly detection

In anomaly or outlier detection, MoE can be used so that one or more experts specialize in normal operating conditions, while another set of experts focuses on capturing rarer phenomena or error states. The gating network learns to detect that an input is unusual and route it to the experts that handle anomalies well.

multi-lingual tasks

As mentioned, multi-lingual text modeling is a prime case for MoE, because each language can represent a fairly distinct distribution. A gating network might pick language-specific experts or even experts that specialize in the morphological or syntactic structure of certain language families.

extended discussion: hardware scaling and distributed setups

gshard and beyond

Google's GShard system exemplified how to train extremely large MoE models by sharding the parameters across a large number of devices. Each expert's parameters (or a subset of them) might reside on a separate worker. During a forward pass, tokens from the global batch are assigned to whichever experts are chosen by the gating function. Then, an efficient collective communication strategy is used to gather the token representations to the relevant device(s), do the forward pass, and gather results back.

A major engineering challenge is ensuring that these communications don't become a bottleneck. Techniques such as capacity factor constraints, expert parallelization, and carefully orchestrated micro-batches can reduce overhead.

memory vs. compute trade-offs

MoE models can push the total parameter count into the trillions, and that in itself can hamper training if not carefully managed. Still, if each token only sees a small fraction of those parameters, the total training compute can remain manageable. This capacity vs. compute dynamic is a large reason behind MoE's popularity among large language model developers.

extended discussion: real-world debugging scenarios

Under-utilized experts in early training: Sometimes, random initialization means the gating network gives higher gating logits to a particular expert. This can snowball: that expert's parameters improve faster, encouraging the gate to use it even more. If you notice this, consider an approach that forces random gating for some warm-up period.
Overlapping experts: Two experts might inadvertently learn very similar functions. This can reduce the overall diversity of the system. Checking correlation or similarity metrics between experts' parameters can reveal if they are redundant.
Expert drift: As training progresses, an expert that was specialized in a certain domain might shift to a different one, potentially leaving the first domain uncovered. Periodic snapshots of gating distributions can reveal domain "drift."

final notes on continuing research

Mixture of experts remains an active area of research, particularly in the context of large-scale systems. Open questions and directions include:

Dynamic creation and removal of experts: Could the model automatically spawn new experts if the data distribution is too broad, or retire experts if they become obsolete?
Hierarchical gating: More complex gating topologies might yield better results for structured data or multi-task scenarios.
Exploration vs. exploitation: The gating network must exploit known assignments (since some experts are definitely good in specific domains) while also exploring assignments that might reveal hidden sub-problems. This dynamic is reminiscent of multi-armed bandit problems in reinforcement learning.
Multi-modal expansions: As multi-modal tasks become standard, we expect new MoE designs that elegantly unify textual, visual, auditory, and even sensor-based inputs under a single gating framework.

And beyond these technical concerns, there is growing interest in the interpretability of large MoE models. Because gating networks essentially "decide" which sub-model handles each input, some interpretability can be gleaned by seeing the gating function's decisions. This is in contrast to monolithic models where it can be harder to discern which part of the network is responsible for which type of input behavior.

additional image placeholders

An image was requested, but the frog was found.

Alt: "MoE architecture diagram"

Caption: "A schematic showing an input x fed into a gating network, which then routes to multiple experts f1, f2, ..., fn, and finally aggregates their results."

Error type: missing path

An image was requested, but the frog was found.

Alt: "Sparse vs. dense gating illustration"

Caption: "Side-by-side depiction of a dense gating approach (where all experts are used) versus a sparse gating approach (where only top-k experts are used for each input)."

Error type: missing path

conclusion

Mixture of experts architectures stand at the forefront of addressing two critical challenges in modern machine learning: the necessity for enormous capacity to handle diverse data, and the reality of constrained computational resources. By embracing a divide-and-conquer philosophy, MoE models enable specialized sub-networks to excel at particular subtasks or input domains, all orchestrated by a gating mechanism that learns optimal routing.

Far from being a mere academic curiosity of the 1990s, mixture of experts has resurfaced as a leading design principle for building extremely large models, notably in natural language processing and, increasingly, in other fields like computer vision and recommender systems. The potential for dynamic routing, conditional computation, and specialized adaptation makes MoE a tantalizing direction for future research — one that might well define the next generation of cutting-edge AI systems.

In your own work, exploring mixture of experts can allow you to scale your models dramatically, handle heterogeneous data distributions with aplomb, and ultimately build deep learning solutions that more closely emulate real-world specialization — mirroring how humans organize expertise to tackle an incredibly wide range of problems.