Yes, the brain is sparse and semi-modularized, but it’d be hard to really call it more ‘brain-like’ than dense models. Brains have all sorts of very long range connections in a small-world topology, where most of the connections may be local but there’s still connections to distant parts, and those are important; distant brain regions can also communicate and be swapped in and out as the brain recurs and ponders. The current breed of MoEs along the lines of Switch Transformer don’t do any of that. They do a single pass, and each module is completely local and firewalled from the others. This is what makes them so ‘efficient’: they are so separate they can be run and optimized easily in parallel with no communication and they handle only limited parts of the problem so they are still early in the scaling curve.
To continue Holden’s analogy, it’s not so much like gluing 100 mouse brains together (or in my expression, ‘gluing a bunch of chihuahuas back to back and expecting them to hunt like a wolf’), it’s like having one mouse brain as a harried overworked MBA manager who must send an email off to one or two of his 99 mouse employees, each of whom then must take care of the job entirely on their own that instant (and are not allowed to communicate or ask for clarification or delegate to any of the other mice).
The more you add recurrency or flexible composition of experts or long-range connections, the more you give up what made them cheap in the first place… I continue to be skeptical that MoEs as currently pursued are anything but a distracting pennywise-poundfoolish sort of diversion, settling for trying to ape GPT-3 at mere fractional savings. Sure, approaches like ERNIE 3.0 Titan look horrifically expensive, but at least they look like they’re pushing into new territory.
Yes, the brain is sparse and semi-modularized, but it’d be hard to really call it more ‘brain-like’ than dense models. Brains have all sorts of very long range connections in a small-world topology, where most of the connections may be local but there’s still connections to distant parts, and those are important; distant brain regions can also communicate and be swapped in and out as the brain recurs and ponders. The current breed of MoEs along the lines of Switch Transformer don’t do any of that. They do a single pass, and each module is completely local and firewalled from the others. This is what makes them so ‘efficient’: they are so separate they can be run and optimized easily in parallel with no communication and they handle only limited parts of the problem so they are still early in the scaling curve.
To continue Holden’s analogy, it’s not so much like gluing 100 mouse brains together (or in my expression, ‘gluing a bunch of chihuahuas back to back and expecting them to hunt like a wolf’), it’s like having one mouse brain as a harried overworked MBA manager who must send an email off to one or two of his 99 mouse employees, each of whom then must take care of the job entirely on their own that instant (and are not allowed to communicate or ask for clarification or delegate to any of the other mice).
The more you add recurrency or flexible composition of experts or long-range connections, the more you give up what made them cheap in the first place… I continue to be skeptical that MoEs as currently pursued are anything but a distracting pennywise-poundfoolish sort of diversion, settling for trying to ape GPT-3 at mere fractional savings. Sure, approaches like ERNIE 3.0 Titan look horrifically expensive, but at least they look like they’re pushing into new territory.
This version of the mice analogy was better than mine, thanks!
Thanks for the detailed reply, that makes sense. What do you make of Google’s Pathways?