1 A 3 O R N

MoE and Dense Transformers Learn the Same Thing

Created: 2024-04-28
Wordcount: 0.2k

Here's an interesting question: Given that they are trained to the same overal loss, do MoE transformers learn different things from dense Transformers?

My extremely preliminary guess was: probably they learn different things?

My experimental results are: nope, looks like they learn the same things, at least when judged with extremely crude tools.

To test, I trained two dense transformers and two MoE transformers to equivalent loss. I did so on exactly the same data in the same order, so the random seed was the only difference between them.

If you then examine the per token loss for the dense transformers, you naturally find a high R^2 between the two; low / loss on one token from one transformer predicts low / high loss on the token from a different transformer. The R^2 was 0.935.

R^2 between the MoE transformer and the dense transformer was a little lower: 0.925.

But, the R^2 between the two MoE transformers was lower yet - 0.924. So the dense transformer seemed (extremely marginally) more predictive of the MoE transformers than the MoE transformers were predictive of each other.

This seems like reasonable evidence that MoE transformers are not systematically learning different things than dense transformers. Although ideally a future analysis would test on more fully-trained transformers -- these were both small and undertrained.