“looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a ‘sharp left turn’).”
Sounds too strong to me. If Nate or Eliezer thought that it would be totally useless to have a perfect understanding of how GPT-3, AlphaZero, and Minerva do their reasoning, then I expect that they’d just say that.
My Nate-model instead says things like:
Current transparency work mostly isn’t trying to gain deep mastery of how GPT-3 etc. do their reasoning; and to the extent it’s trying, it isn’t making meaningful progress.
(‘Deep mastery of how this system does its reasoning’ is the sort of thing that would let us roughly understand what thoughts a chimpanzee is internally thinking at a given time, verify that it’s pursuing the right kinds of goals and thinking about all (and only) the right kinds of topics, etc.)
A lot of other alignment research isn’t even trying to understand chimpanzee brains, or future human brains, or generalizations that might hold for both chimps and humans; it’s just assuming there’s no important future chimp-to-human transition it has to worry about.
Once we build the equivalent of ‘humans’, we won’t have much time to align them before the tech proliferates and someone accidentally destroys the world. So even if the ‘understand human cognition’ problem turns out to be easier than the ‘understand chimpanzee cognition’ problem in a vacuum, the fact that it’s a new problem and we have a lot less time to solve it makes it a lot harder in practice.
Sounds too strong to me. If Nate or Eliezer thought that it would be totally useless to have a perfect understanding of how GPT-3, AlphaZero, and Minerva do their reasoning, then I expect that they’d just say that.
My Nate-model instead says things like:
Current transparency work mostly isn’t trying to gain deep mastery of how GPT-3 etc. do their reasoning; and to the extent it’s trying, it isn’t making meaningful progress.
(‘Deep mastery of how this system does its reasoning’ is the sort of thing that would let us roughly understand what thoughts a chimpanzee is internally thinking at a given time, verify that it’s pursuing the right kinds of goals and thinking about all (and only) the right kinds of topics, etc.)
A lot of other alignment research isn’t even trying to understand chimpanzee brains, or future human brains, or generalizations that might hold for both chimps and humans; it’s just assuming there’s no important future chimp-to-human transition it has to worry about.
Once we build the equivalent of ‘humans’, we won’t have much time to align them before the tech proliferates and someone accidentally destroys the world. So even if the ‘understand human cognition’ problem turns out to be easier than the ‘understand chimpanzee cognition’ problem in a vacuum, the fact that it’s a new problem and we have a lot less time to solve it makes it a lot harder in practice.