with the exception of “mixture-of-experts models” that I think we should disregard for these purposes, for reasons I won’t go into here
This is taken from a footnote. Clicking on the link and reading the abstract, it immediately jumped out as something that we should be potentially quite concerned about (i.e. the potential to scale models by ~1000x using the same compute!), so I’m curious about the reasons for disregarding that you didn’t go into in the post. Can you go into them here?
Using the “Cited by” feature on Google Scholar, I’ve found some more recentpapers, which give an impression that there is progress being made with mixture-of-experts models (that could potentially dramatically speed up timelines?). Also, naively, is this not kind of how the brain works? Different subsets of neurons (brain areas) are used for different tasks (vision, hearing, memory etc). Emulating this with ML models seems like it would be a huge step forward for AI.
Yes, the brain is sparse and semi-modularized, but it’d be hard to really call it more ‘brain-like’ than dense models. Brains have all sorts of very long range connections in a small-world topology, where most of the connections may be local but there’s still connections to distant parts, and those are important; distant brain regions can also communicate and be swapped in and out as the brain recurs and ponders. The current breed of MoEs along the lines of Switch Transformer don’t do any of that. They do a single pass, and each module is completely local and firewalled from the others. This is what makes them so ‘efficient’: they are so separate they can be run and optimized easily in parallel with no communication and they handle only limited parts of the problem so they are still early in the scaling curve.
To continue Holden’s analogy, it’s not so much like gluing 100 mouse brains together (or in my expression, ‘gluing a bunch of chihuahuas back to back and expecting them to hunt like a wolf’), it’s like having one mouse brain as a harried overworked MBA manager who must send an email off to one or two of his 99 mouse employees, each of whom then must take care of the job entirely on their own that instant (and are not allowed to communicate or ask for clarification or delegate to any of the other mice).
The more you add recurrency or flexible composition of experts or long-range connections, the more you give up what made them cheap in the first place… I continue to be skeptical that MoEs as currently pursued are anything but a distracting pennywise-poundfoolish sort of diversion, settling for trying to ape GPT-3 at mere fractional savings. Sure, approaches like ERNIE 3.0 Titan look horrifically expensive, but at least they look like they’re pushing into new territory.
My impression from following r/mlscaling for a while and reading a bunch of comments by Gwern and also various papers… is that MoE models aren’t that good. But I don’t really have a good understanding of this so I could be wrong. I guess I just am deferring to others like Gwern. I also am thinking if MoE models were anywhere near as good as 1000x then we would have seen something dramatically better than GPT-3 already and we haven’t.
My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).
This is taken from a footnote. Clicking on the link and reading the abstract, it immediately jumped out as something that we should be potentially quite concerned about (i.e. the potential to scale models by ~1000x using the same compute!), so I’m curious about the reasons for disregarding that you didn’t go into in the post. Can you go into them here?
Using the “Cited by” feature on Google Scholar, I’ve found some more recent papers, which give an impression that there is progress being made with mixture-of-experts models (that could potentially dramatically speed up timelines?). Also, naively, is this not kind of how the brain works? Different subsets of neurons (brain areas) are used for different tasks (vision, hearing, memory etc). Emulating this with ML models seems like it would be a huge step forward for AI.
Yes, the brain is sparse and semi-modularized, but it’d be hard to really call it more ‘brain-like’ than dense models. Brains have all sorts of very long range connections in a small-world topology, where most of the connections may be local but there’s still connections to distant parts, and those are important; distant brain regions can also communicate and be swapped in and out as the brain recurs and ponders. The current breed of MoEs along the lines of Switch Transformer don’t do any of that. They do a single pass, and each module is completely local and firewalled from the others. This is what makes them so ‘efficient’: they are so separate they can be run and optimized easily in parallel with no communication and they handle only limited parts of the problem so they are still early in the scaling curve.
To continue Holden’s analogy, it’s not so much like gluing 100 mouse brains together (or in my expression, ‘gluing a bunch of chihuahuas back to back and expecting them to hunt like a wolf’), it’s like having one mouse brain as a harried overworked MBA manager who must send an email off to one or two of his 99 mouse employees, each of whom then must take care of the job entirely on their own that instant (and are not allowed to communicate or ask for clarification or delegate to any of the other mice).
The more you add recurrency or flexible composition of experts or long-range connections, the more you give up what made them cheap in the first place… I continue to be skeptical that MoEs as currently pursued are anything but a distracting pennywise-poundfoolish sort of diversion, settling for trying to ape GPT-3 at mere fractional savings. Sure, approaches like ERNIE 3.0 Titan look horrifically expensive, but at least they look like they’re pushing into new territory.
This version of the mice analogy was better than mine, thanks!
Thanks for the detailed reply, that makes sense. What do you make of Google’s Pathways?
My impression from following r/mlscaling for a while and reading a bunch of comments by Gwern and also various papers… is that MoE models aren’t that good. But I don’t really have a good understanding of this so I could be wrong. I guess I just am deferring to others like Gwern. I also am thinking if MoE models were anywhere near as good as 1000x then we would have seen something dramatically better than GPT-3 already and we haven’t.
My understanding is that “mixture of experts” essentially comes down to training multiple distinct models, and having some “meta” procedure for assigning problems (or pieces of problems) to them.
Since training expense grows with something like the square of model size, it’s much more expensive to train one big model than N smaller models that are each 1/N as big (plus a procedure for choosing between the N smaller models).
A human brain is about 100x the “size” of a mouse brain. So for a metaphor, you can think of “mixture of experts” as though it’s trying to use 100 “mouse brains” (all working together under one procedure, and referred to as a single model) in place of one “human brain.” This should be a lot cheaper (see previous paragraph), and there are intuitive reasons we’d expect it to be less powerful as well (imagine trying to assign intellectual tasks to 100 mice in a way that mimics what a human can do).