b) Same for “the capability program is an easier technical problem than the alignment program”. You don’t know that; nobody knows that; Lord Kelvin/Einstein/Ehrlich/etc would all have said “X is an easier technical problem than flight/nuclear energy/feeding the world/etc” for a wide range of X, a few years before each of those actually happened.
Even if we should be undecided here, there’s an asymmetry where, if you get alignment too early, that’s okay, but getting capabilities before alignment is bad. Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
On the object level, if we think the scaling hypothesis is roughly correct (or “close enough”) or if we consider it telling that evolution probably didn’t have the sophistication to install much specialized brain circuitry between humans and other great apes, then it seems like getting capabilities past some universality and self-improvement/self-rearrangement (“learning how to become better at learning/learning how to become better at thinking”) threshold cannot be that difficult? Especially considering that we arguably already have “weak AGI.” (But maybe you have an inside view that says we still have huge capability obstacles to overcome?)
At the same time, alignment research seems to be in a fairly underdeveloped state (at least my impression as a curious outsider), so I’d say “alignment is harder than capabilities” seems almost certainly true. Factoring in lots of caveats about how they aren’t always cleanly separable, and so on, doesn’t seem to change that.
Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
I’d say “alignment is harder than capabilities” seems almost certainly true
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.
Even if we should be undecided here, there’s an asymmetry where, if you get alignment too early, that’s okay, but getting capabilities before alignment is bad. Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
On the object level, if we think the scaling hypothesis is roughly correct (or “close enough”) or if we consider it telling that evolution probably didn’t have the sophistication to install much specialized brain circuitry between humans and other great apes, then it seems like getting capabilities past some universality and self-improvement/self-rearrangement (“learning how to become better at learning/learning how to become better at thinking”) threshold cannot be that difficult? Especially considering that we arguably already have “weak AGI.” (But maybe you have an inside view that says we still have huge capability obstacles to overcome?)
At the same time, alignment research seems to be in a fairly underdeveloped state (at least my impression as a curious outsider), so I’d say “alignment is harder than capabilities” seems almost certainly true. Factoring in lots of caveats about how they aren’t always cleanly separable, and so on, doesn’t seem to change that.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.