I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.
One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.
“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y’know, the way some humans really want to break out of prison, or the way Elon Musk really wants to go to Mars. Maybe the AIs have other desires and do other things too, but that’s not too relevant to what I’m saying. Next, There are a lot of reasons to think that “AIs that really want something-or-other to happen in the future” will show up sooner or later, e.g. the fact that smart people have been trying to build them since the dawn of AI and continuing through today. And if we get such AIs, and they’re very smart and competent, it has similar relevant consequences as “rigid utility maximizing consequentialists”—particularly power-seeking / instrumental convergence, and not pursuing plans that have obvious and effective countermeasures.”
Do you buy that argument? If so, I think some discussions of “rigid utility maximizing consequentialists” can be useful. I also think that some such discussions can lead to conclusions that do not necessarily transfer to more realistic AGIs (see here). So again, I think we should avoid yay-versus-boo thinking.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed.
I think that part of the blog post you linked was being facetious. IIUC they had some undisclosed research program involving Haskell for a few years, and then they partly but not entirely wound it down when it wasn’t going as well as they had hoped. But they have also been doing other things too the whole time, like their agent foundations team. (I have no personal knowledge beyond reading the newsletters etc.)
For example, FWIW, I have personally found MIRI employee Abram Demski’s blog posts (including pre-2020) to be very helpful to my thinking about AGI alignment.
Anyway, your more general claim in this section seems to be: Given current levels of capabilities, there is no more alignment research to be done. We’re tapped out. The well is dry. The only possible thing left to do is twiddle our thumbs and wait for more capable models to come out.
Is that really your belief? Do you look at literally everything on alignmentforum etc. as total garbage? Obviously I have a COI but I happen to think there is lots of alignment work yet to do that would be helpful and does not need newly-advanced capabilities to happen.
Nothing in this comment should be construed as “all things considered we should be for or against the pause”—as it happens I’m weakly against the pause too—these are narrower points. :)
I certainly give relatively little weight to most conceptual AI research. That said, I respect that it’s valuable for you and am open to trying to narrow the gap between our views here—I’m just not sure how!
To be more concrete, I’d value 1 year of current progress over 10 years of pre-2018 research (to pick a date relatively arbitrarily). I don’t intend this as an attack on the earlier alignment community, I just think we’re making empirical progress in a way that was pretty much impossible before we had good models available to study and I place a lot more value on this.
I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)
I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯
My own research doesn’t involve LLMs and could have been done in 2017, but I’m not sure I would call it “purely conceptual”—it involves a lot of stuff like scrutinizing data tables in experimental neuroscience papers. The ELK research project led by Paul Christiano also could have been done 2017, as far as I can tell, but lots of people seem to think it’s worthwhile; do you? (Paul is a coinventor of RLHF.)
I’ve certainly heard of your work but it’s far enough out of my research interests that I’ve never taken a particularly strong interest. Writing this in this context makes me realise I might have made a bit of a one-man echo chamber for myself… Do you mind if we leave this as ‘undecided’ for a while?
Regarding ELK—I think the core of the problem as I understand it is fairly clear once you begin thinking about interpretability. Understanding the relation between AI and human ontologies was part of the motivation behind my work on alphazero (as well as an interest in the natural abstractions hypothesis). Section 4 “Encoding of human conceptual knowledge” and Section 8 “Exploring activations with unsupervised methods” are the places to look. The section on challenges and limitations in concept probing I think echoes a lot of the concerns in ELK.
In terms of subsequent work on ELK, I don’t think much of the work on solving ELK was particularly useful, and often reinvented existing methods (e.g. sparse probing, causal interchange interventions). If I were to try and work on it then I think the best way to do so would be to embed the core challenge in a tractable research program, for instance trying to extract new scientific knowledge from ML models like alphafold.
To move this in a more positive direction, the most fruitful/exciting conceptual work I’ve seen is probably (1) the natural abstractions hypothesis and (2) debate. When I think a bit about why I particularly like these, for (1) it’s because it seems plausibly true, extremely useful if true, and amenable to both formal theoretical work and empirical study. For (2) it’s because it’s a pretty striking new idea that seems very powerful/scalable, but also can be put into practice a bit ahead of really powerful systems.
It’s perhaps also worth separating the claims that A) previous alignment research was significantly less helpful than today’s research and B) the reason that was the case continues to hold today.
I think I’d agree with some version of A, but strongly disagree with B.
The reason that A seems probably true to me is that we didn’t know the basic paradigm in which AGI would arise, and so previous research was forced to wander in the dark. You might also believe that today’s focus on empirical research is better than yesterday’s focus on theoretical research (I don’t necessarily agree) or at least that theoretical research without empirical feedback is on thin ice (I agree).
I think most people now think that deep learning, perhaps with some modifications, will be what leads to AGI—some even think that LLM-like systems will be sufficient. And the shift from primarily theoretical research to primarily empirical research has already happened. So what will cause today’s research to be worse than future research with more capable models? You can appeal to a general principle of “unknown unknowns,” but if you genuinely believe that deep learning (or LLMs) will eventually be used in future AGI, it seems hard to believe that knowledge won’t transfer at all.
Steven the issue is without empirical data you end up with a branching tree of possible futures. And if you make some faulty assumptions early—such as assuming the amount of compute needed to host optimal AI models is small and easily stolen via hacking—you end up lost in a tree of possibilities where every one you consider is “doom”. And thus you arrive at the conclusion of “pDoom is 99 percent”, because you are only cognitively able to consider adjacent futures in the possibility tree. No living human can keep track of thousands of possibilities in parallel. This is where I think Eliezer and Zvi are lost, where they simply ignore branches that would lead to different outcomes.
(And vice versa, you could arrive at the opposite conclusion).
It becomes angels at the head of a pin. There is no way to make a policy decision based on this. You need to prove you beliefs with data. It’s how we even got here as a species.
I feel like you’re trying to round these three things into a “yay versus boo” axis, and then come down on the side of “boo”. I think we can try to do better than that.
One can make certain general claims about learning algorithms that are true and for which evolution provides as good an example as any. One can also make other claims that are true for evolution and false for other learning algorithms. and then we can argue about which category future AGI will be in. I think we should be open to that kind of dialog, and it involves talking about evolution.
Likewise, I think “inner misalignment versus outer misalignment” is a helpful and valid way to classify certain failure modes of certain AI algorithms.
For the third one, there’s an argument like:
“Maybe the AI will really want something-or-other to happen in the future, and try to make it happen, including by long-term planning—y’know, the way some humans really want to break out of prison, or the way Elon Musk really wants to go to Mars. Maybe the AIs have other desires and do other things too, but that’s not too relevant to what I’m saying. Next, There are a lot of reasons to think that “AIs that really want something-or-other to happen in the future” will show up sooner or later, e.g. the fact that smart people have been trying to build them since the dawn of AI and continuing through today. And if we get such AIs, and they’re very smart and competent, it has similar relevant consequences as “rigid utility maximizing consequentialists”—particularly power-seeking / instrumental convergence, and not pursuing plans that have obvious and effective countermeasures.”
Do you buy that argument? If so, I think some discussions of “rigid utility maximizing consequentialists” can be useful. I also think that some such discussions can lead to conclusions that do not necessarily transfer to more realistic AGIs (see here). So again, I think we should avoid yay-versus-boo thinking.
I think that part of the blog post you linked was being facetious. IIUC they had some undisclosed research program involving Haskell for a few years, and then they partly but not entirely wound it down when it wasn’t going as well as they had hoped. But they have also been doing other things too the whole time, like their agent foundations team. (I have no personal knowledge beyond reading the newsletters etc.)
For example, FWIW, I have personally found MIRI employee Abram Demski’s blog posts (including pre-2020) to be very helpful to my thinking about AGI alignment.
Anyway, your more general claim in this section seems to be: Given current levels of capabilities, there is no more alignment research to be done. We’re tapped out. The well is dry. The only possible thing left to do is twiddle our thumbs and wait for more capable models to come out.
Is that really your belief? Do you look at literally everything on alignmentforum etc. as total garbage? Obviously I have a COI but I happen to think there is lots of alignment work yet to do that would be helpful and does not need newly-advanced capabilities to happen.
Nothing in this comment should be construed as “all things considered we should be for or against the pause”—as it happens I’m weakly against the pause too—these are narrower points. :)
I certainly give relatively little weight to most conceptual AI research. That said, I respect that it’s valuable for you and am open to trying to narrow the gap between our views here—I’m just not sure how!
To be more concrete, I’d value 1 year of current progress over 10 years of pre-2018 research (to pick a date relatively arbitrarily). I don’t intend this as an attack on the earlier alignment community, I just think we’re making empirical progress in a way that was pretty much impossible before we had good models available to study and I place a lot more value on this.
I have a vague impression—I forget from where and it may well be false—that Nora has read some of my AI alignment research, and that she thinks of it as not entirely pointless. If so, then when I say “pre-2020 MIRI (esp. Abram & Eliezer) deserve some share of the credit for my thinking”, then that’s meaningful, because there is in fact some nonzero credit to be given. Conversely, if you (or anyone) don’t know anything about my AI alignment research, or think it’s dumb, then you should ignore that part of my comment, it’s not offering any evidence, it would just be saying that useless research can sometimes lead to further useless research, which is obvious! :)
I probably think less of current “empirical” research than you, because I don’t think AGI will look and act and be built just like today’s LLMs but better / larger. I expect highly-alignment-relevant differences between here and there, including (among other things) reinforcement learning being involved in a much more central way than it is today (i.e. RLHF fine-tuning). This is a big topic where I think reasonable people disagree and maybe this comment section isn’t a great place to hash it out. ¯\_(ツ)_/¯
My own research doesn’t involve LLMs and could have been done in 2017, but I’m not sure I would call it “purely conceptual”—it involves a lot of stuff like scrutinizing data tables in experimental neuroscience papers. The ELK research project led by Paul Christiano also could have been done 2017, as far as I can tell, but lots of people seem to think it’s worthwhile; do you? (Paul is a coinventor of RLHF.)
I’ve certainly heard of your work but it’s far enough out of my research interests that I’ve never taken a particularly strong interest. Writing this in this context makes me realise I might have made a bit of a one-man echo chamber for myself… Do you mind if we leave this as ‘undecided’ for a while?
Regarding ELK—I think the core of the problem as I understand it is fairly clear once you begin thinking about interpretability. Understanding the relation between AI and human ontologies was part of the motivation behind my work on alphazero (as well as an interest in the natural abstractions hypothesis). Section 4 “Encoding of human conceptual knowledge” and Section 8 “Exploring activations with unsupervised methods” are the places to look. The section on challenges and limitations in concept probing I think echoes a lot of the concerns in ELK.
In terms of subsequent work on ELK, I don’t think much of the work on solving ELK was particularly useful, and often reinvented existing methods (e.g. sparse probing, causal interchange interventions). If I were to try and work on it then I think the best way to do so would be to embed the core challenge in a tractable research program, for instance trying to extract new scientific knowledge from ML models like alphafold.
To move this in a more positive direction, the most fruitful/exciting conceptual work I’ve seen is probably (1) the natural abstractions hypothesis and (2) debate. When I think a bit about why I particularly like these, for (1) it’s because it seems plausibly true, extremely useful if true, and amenable to both formal theoretical work and empirical study. For (2) it’s because it’s a pretty striking new idea that seems very powerful/scalable, but also can be put into practice a bit ahead of really powerful systems.
It’s perhaps also worth separating the claims that A) previous alignment research was significantly less helpful than today’s research and B) the reason that was the case continues to hold today.
I think I’d agree with some version of A, but strongly disagree with B.
The reason that A seems probably true to me is that we didn’t know the basic paradigm in which AGI would arise, and so previous research was forced to wander in the dark. You might also believe that today’s focus on empirical research is better than yesterday’s focus on theoretical research (I don’t necessarily agree) or at least that theoretical research without empirical feedback is on thin ice (I agree).
I think most people now think that deep learning, perhaps with some modifications, will be what leads to AGI—some even think that LLM-like systems will be sufficient. And the shift from primarily theoretical research to primarily empirical research has already happened. So what will cause today’s research to be worse than future research with more capable models? You can appeal to a general principle of “unknown unknowns,” but if you genuinely believe that deep learning (or LLMs) will eventually be used in future AGI, it seems hard to believe that knowledge won’t transfer at all.
Steven the issue is without empirical data you end up with a branching tree of possible futures. And if you make some faulty assumptions early—such as assuming the amount of compute needed to host optimal AI models is small and easily stolen via hacking—you end up lost in a tree of possibilities where every one you consider is “doom”. And thus you arrive at the conclusion of “pDoom is 99 percent”, because you are only cognitively able to consider adjacent futures in the possibility tree. No living human can keep track of thousands of possibilities in parallel. This is where I think Eliezer and Zvi are lost, where they simply ignore branches that would lead to different outcomes.
(And vice versa, you could arrive at the opposite conclusion).
It becomes angels at the head of a pin. There is no way to make a policy decision based on this. You need to prove you beliefs with data. It’s how we even got here as a species.