I take this post to argue that, just as an AGI’s alignment property won’t generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also won’t generalise well out-of-distribution. Does that seem like a fair (if brief) summary?
As an aside, I feel like it’s more fruitful to talk about specific classes of defects rather than all of them together. You use the word “bug” to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like “the inherent bugginess of AI is a very good thing for AI safety”, whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/correct behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
Yes, that’s a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/bug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
I feel like it’s more fruitful to talk about specific classes of defects rather than all of them together. You use the word “bug” to mean everything from divide by zero crashes to wrong beliefs
That’s fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, i’ve edited it to be more accurate.
I take this post to argue that, just as an AGI’s alignment property won’t generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also won’t generalise well out-of-distribution. Does that seem like a fair (if brief) summary?
As an aside, I feel like it’s more fruitful to talk about specific classes of defects rather than all of them together. You use the word “bug” to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like “the inherent bugginess of AI is a very good thing for AI safety”, whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/correct behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
Yes, that’s a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/bug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
That’s fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, i’ve edited it to be more accurate.