There are 2 concurrent research programs, and if one program (capability) completes before the other one (alignment), we all die, but the capability program is an easier technical problem than the alignment program. Do you disagree with that framing?
Yepp, I disagree on a bunch of counts.
a) I dislike the phrase “we all die”, nobody has justifiable confidence high enough to make that claim, even if ASI is misaligned enough to seize power there’s a pretty wide range of options for the future of humans, including some really good ones (just like there’s a pretty wide range of options for the future of gorillas, if humans remain in charge).
b) Same for “the capability program is an easier technical problem than the alignment program”. You don’t know that; nobody knows that; Lord Kelvin/Einstein/Ehrlich/etc would all have said “X is an easier technical problem than flight/nuclear energy/feeding the world/etc” for a wide range of X, a few years before each of those actually happened.
c) The distinction between capabilities and alignment is a useful concept when choosing research on an individual level; but it’s far from robust enough to be a good organizing principle on a societal level. There is a lot of disagreement about what qualifies as which, and to which extent, even within the safety community; I think there are a whole bunch of predictable failure modes of the political position that “here is the bad thing that must be prevented at all costs, and here is the good thing we’re crucially trying to promote, and also everyone disagrees on where the line between them is and they’re done by many of the same people”. This feels like a recipe for unproductive or counterproductive advocacy, corrupt institutions, etc. If alignment researchers had to demonstrate that their work had no capabilities externalities, they’d never get anything done (just as, if renewables researchers had to demonstrate that their research didn’t involve emitting any carbon, they’d never get anything done). I will write about possible alternative framings in an upcoming post.
I’m guessing you would oppose a worldwide ban starting today on all “experimental” AI research (i.e., all use of computing resources to run AIs) till the scholars of the world settle on how to keep an AI aligned through the transition to superintelligence.
As written, I would oppose this. I doubt the world as a whole could solve alignment with zero AI experiments; feels like asking medieval theologians to figure out the correct theory of physics without ever doing experiments.
b) Same for “the capability program is an easier technical problem than the alignment program”. You don’t know that; nobody knows that; Lord Kelvin/Einstein/Ehrlich/etc would all have said “X is an easier technical problem than flight/nuclear energy/feeding the world/etc” for a wide range of X, a few years before each of those actually happened.
Even if we should be undecided here, there’s an asymmetry where, if you get alignment too early, that’s okay, but getting capabilities before alignment is bad. Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
On the object level, if we think the scaling hypothesis is roughly correct (or “close enough”) or if we consider it telling that evolution probably didn’t have the sophistication to install much specialized brain circuitry between humans and other great apes, then it seems like getting capabilities past some universality and self-improvement/self-rearrangement (“learning how to become better at learning/learning how to become better at thinking”) threshold cannot be that difficult? Especially considering that we arguably already have “weak AGI.” (But maybe you have an inside view that says we still have huge capability obstacles to overcome?)
At the same time, alignment research seems to be in a fairly underdeveloped state (at least my impression as a curious outsider), so I’d say “alignment is harder than capabilities” seems almost certainly true. Factoring in lots of caveats about how they aren’t always cleanly separable, and so on, doesn’t seem to change that.
Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
I’d say “alignment is harder than capabilities” seems almost certainly true
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.
the capability program is an easier technical problem than the alignment program.
You don’t know that; nobody knows that
Do you concede that frontier AI research is intrinsically dangerous?
That it is among the handful of the most dangerous research programs ever pursued by our civilization?
If not, I hope you can see why those who do consider it intrinsically dangerous are not particularly mollified or reassured by “well, who knows? maybe it will turn out OK in the end!”
The distinction between capabilities and alignment is a useful concept when choosing research on an individual level; but it’s far from robust enough to be a good organizing principle on a societal level.
When I wrote “the alignment program” above, I meant something specific, which I believe you will agree is robust enough to organize society (if only we could get society to go along with it): namely, I meant thinking hard together about alignment without doing anything dangerous like training up models with billions of parameters till we have at least a rough design that most of the professional researchers agree is more likely to help us than to kill us even if it turns out to have super-human capabilities—even if our settling on that design takes us many decades. E.g., what MIRI has been doing the last 20 years.
I dislike the phrase “we all die”, nobody has justifiable confidence high enough to make that claim, even if ASI is misaligned enough to seize power there’s a pretty wide range of options for the future of humans
It makes me sad that you do not see that “we all die” is the default outcome that naturally happens unless a lot of correct optimization pressure is applied by the researchers to the design of the first sufficiently-capable AI before the AI is given computing resources. It would have been nice to have someone with your capacity for clear thinking working on the problem. Are you sure you’re not overly attached (e.g., for intrapersonal motivational reasons) to an optimistic vision in which AI research “feels like the early days of hacker culture” and “there are hackathons where people build fun demos”?
Yepp, I disagree on a bunch of counts.
a) I dislike the phrase “we all die”, nobody has justifiable confidence high enough to make that claim, even if ASI is misaligned enough to seize power there’s a pretty wide range of options for the future of humans, including some really good ones (just like there’s a pretty wide range of options for the future of gorillas, if humans remain in charge).
b) Same for “the capability program is an easier technical problem than the alignment program”. You don’t know that; nobody knows that; Lord Kelvin/Einstein/Ehrlich/etc would all have said “X is an easier technical problem than flight/nuclear energy/feeding the world/etc” for a wide range of X, a few years before each of those actually happened.
c) The distinction between capabilities and alignment is a useful concept when choosing research on an individual level; but it’s far from robust enough to be a good organizing principle on a societal level. There is a lot of disagreement about what qualifies as which, and to which extent, even within the safety community; I think there are a whole bunch of predictable failure modes of the political position that “here is the bad thing that must be prevented at all costs, and here is the good thing we’re crucially trying to promote, and also everyone disagrees on where the line between them is and they’re done by many of the same people”. This feels like a recipe for unproductive or counterproductive advocacy, corrupt institutions, etc. If alignment researchers had to demonstrate that their work had no capabilities externalities, they’d never get anything done (just as, if renewables researchers had to demonstrate that their research didn’t involve emitting any carbon, they’d never get anything done). I will write about possible alternative framings in an upcoming post.
As written, I would oppose this. I doubt the world as a whole could solve alignment with zero AI experiments; feels like asking medieval theologians to figure out the correct theory of physics without ever doing experiments.
Even if we should be undecided here, there’s an asymmetry where, if you get alignment too early, that’s okay, but getting capabilities before alignment is bad. Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
On the object level, if we think the scaling hypothesis is roughly correct (or “close enough”) or if we consider it telling that evolution probably didn’t have the sophistication to install much specialized brain circuitry between humans and other great apes, then it seems like getting capabilities past some universality and self-improvement/self-rearrangement (“learning how to become better at learning/learning how to become better at thinking”) threshold cannot be that difficult? Especially considering that we arguably already have “weak AGI.” (But maybe you have an inside view that says we still have huge capability obstacles to overcome?)
At the same time, alignment research seems to be in a fairly underdeveloped state (at least my impression as a curious outsider), so I’d say “alignment is harder than capabilities” seems almost certainly true. Factoring in lots of caveats about how they aren’t always cleanly separable, and so on, doesn’t seem to change that.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.
Do you concede that frontier AI research is intrinsically dangerous?
That it is among the handful of the most dangerous research programs ever pursued by our civilization?
If not, I hope you can see why those who do consider it intrinsically dangerous are not particularly mollified or reassured by “well, who knows? maybe it will turn out OK in the end!”
When I wrote “the alignment program” above, I meant something specific, which I believe you will agree is robust enough to organize society (if only we could get society to go along with it): namely, I meant thinking hard together about alignment without doing anything dangerous like training up models with billions of parameters till we have at least a rough design that most of the professional researchers agree is more likely to help us than to kill us even if it turns out to have super-human capabilities—even if our settling on that design takes us many decades. E.g., what MIRI has been doing the last 20 years.
It makes me sad that you do not see that “we all die” is the default outcome that naturally happens unless a lot of correct optimization pressure is applied by the researchers to the design of the first sufficiently-capable AI before the AI is given computing resources. It would have been nice to have someone with your capacity for clear thinking working on the problem. Are you sure you’re not overly attached (e.g., for intrapersonal motivational reasons) to an optimistic vision in which AI research “feels like the early days of hacker culture” and “there are hackathons where people build fun demos”?