My impression from following r/mlscaling for a while and reading a bunch of comments by Gwern and also various papers… is that MoE models aren’t that good. But I don’t really have a good understanding of this so I could be wrong. I guess I just am deferring to others like Gwern. I also am thinking if MoE models were anywhere near as good as 1000x then we would have seen something dramatically better than GPT-3 already and we haven’t.
My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).
My impression from following r/mlscaling for a while and reading a bunch of comments by Gwern and also various papers… is that MoE models aren’t that good. But I don’t really have a good understanding of this so I could be wrong. I guess I just am deferring to others like Gwern. I also am thinking if MoE models were anywhere near as good as 1000x then we would have seen something dramatically better than GPT-3 already and we haven’t.
My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).