Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
I’d say “alignment is harder than capabilities” seems almost certainly true
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.
I am not disputing this :) I am just disputing the factual claim that we know which is easier.
Are you making the claim that we’re almost certainly not in a world where alignment is easy? (E.g. only requires something like Debate/IA and maybe some rudimentary interpretability techniques.) I don’t see how you could know that.
I’m not sure if I’m claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about “alignment might turn out to be easy” probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we’ll solve alignment really soon. In fact, no one seems confident that we’ll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer’s claims that it’s important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the “we need to get it right on the first try” argument). I’m less sure how much I buy Eliezer’s confidence that “niceness/helpfulness” isn’t easy to train/isn’t a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it’s highly unlikely to re-play in ML training. And there I’m more like “Hm, hard to know.” So, I’m not pessimistic for inherent technical reasons. It’s more that I’m pessimistic because I think we’ll fumble the ball even if we’re in the lucky world where the technical stuff is surprisingly easy.
That said, I still think “alignment difficulty?” isn’t the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.