One of the three major threads in this post (I think) is alignment optimism: AI safety probably isn’t super hard.
A possible implication is that a pause is unnecessary. But the difficulty of alignment doesn’t seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.
But the difficulty of alignment doesn’t seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.
At the extremes, if alignment-to-”good”-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It’s unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress.
Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to “worse” values than the default outcome.
(I realize I’m using two different definitions of alignment in this comment, would like to know if there’s standardized terminology to differentiate between them)
One of the three major threads in this post (I think) is alignment optimism: AI safety probably isn’t super hard.
A possible implication is that a pause is unnecessary. But the difficulty of alignment doesn’t seem to imply much about whether slowing is good or bad, or about its priority relative to other goals.
(I disagree that gradient descent entails “we are the innate reward system” and thus safe, or that “full read-write access to [AI systems’] internals” gives safety in the absence of great interpretability. I think likely failure modes include AI playing the training game, influence-seeking behavior dominating, misalignment during capabilities generalization, and catastrophic Goodharting, and that AGI Ruin: A List of Lethalities is largely right. But I think in this debate we should focus on determining optimal behavior as a function of the difficulty of alignment, rather than having intractable arguments about the difficulty of alignment.)
Yes. This one seems critical, and I don’t understand it at all.
At the extremes, if alignment-to-”good”-values by default was 100% likely I presume slowing down would be net-negative, and racing ahead would look great. It’s unclear to me where the tipping point is, what kind of distribution over different alignment difficulty levels one would need to have to tip from wanting to speed up vs wanting to slow down AI progress.
Seems to me like the more longtermist one is, the more slowing down looks good even when one is very optimistic about alignment. Then again there are some considerations that push against this: risk of totalitarianism, risk of pause that never ends, risk of value-agnostic alignment being solved and the first AGI being aligned to “worse” values than the default outcome.
(I realize I’m using two different definitions of alignment in this comment, would like to know if there’s standardized terminology to differentiate between them)