This would be a first article in the series of articles exploring MoE. We would be exploring MoE from scratch to the point where it was rumored to be the part of the GPT-4 architecture.

Some history behind MoE

The Godfather of AI, Geoffrey E. Hinton (along with 3 others) worked on a paper in 1991 called ‘Adaptive Mixtures of Local Experts’ (https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf). The whole idea of the paper was to develop a system composed of separate networks which would handle different subsets of training data rather than putting the complete load on a single FFN (Feed forward neural network).

After that, MoE have been a point of focus in machine learning research. One of the important works being Shazeer et al. (https://arxiv.org/abs/1701.06538) in which the authors trained used a MoE based LSTM to train a model with 137B parameters while keeping very high inference speed.

Thereafter several models have been developed using this paradigm - Mixtral-8×7B , DeepSeek-V2, Grok-1 and most famously it has been suspected to be used in GPT-4.

Below figure shows overview of different MoE model over the years -

{FD2AE0AF-43E4-4A9B-9401-8597C45B33C1}.png

So why do we even need MoE?

We have already seen the advancements of LLMs in 2023 and 2024 but everyone now is talking about the scaling laws for LLMs. Well, here scaling includes everything - model’s size, the amount of data for training, and the amount of computation.

So, MoE is very useful in this case as it can achieve the same quality as a highly dense model with much more number of parameters.

We’ll be exploring this in detail in the next article!

What is MoE?

The MoE framework is pretty simple - the whole model is divided into several parts known as experts and these different experts specialize in different tasks or focus on different types of data. These experts are activated using a gating function. Using this paradigm, helps the whole model to only activate certain experts depending upon the input and this helps to put the computational cost in check. It is also considered as an ensemble learning technique as it combines multiple specialized models or experts.

Architecture components:

  1. Experts: Independent models (basically a neural network) trained on specific portions of the data. This specific portion is decided by the gating function.
  2. Gating Network: A model (like a FFN) that takes an input and outputs a probability distribution over the experts. On the basis of this probability, it decides which expert to use for that particular data point.