My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).
My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).