I put down this book agreeing that we need to control AI (and indeed we can, according to Russell, with good engineering). But if intelligence is intelligence is intelligence then must we necessarily turn to humans, and constrain them in the same way so that humans don’t pursue ‘goals inside the human’ that are significantly at odds with ‘our’ preferences?
Elsewhere we sometimes call this the “human alignment problem” and use it as a test case in the sense that if we can’t design a mechanism at least robust enough to solve human alignment we probably can’t use it to solve AI alignment because AIs (especially superhuman AI) are much better optimizers than humans. Some might argue against this, pointing out that humans are fallible in ways that machines are not, but the point is that if you can’t make safe something so bad at optimizing as humans who often look like they are just taking random walks due to a wide variety of reasons, you can’t possibly hope to make safe something that is reliably good at achieving its goals.
But we can decide what goes inside the machine, whereas with people we can only control outside circumstances. It seems to me that such a mechanism would be highly likely to be an internal mechanism, so wouldn’t be applicable to people
We’re in an analogous situation with AI. AI is too complex for us to fully understand what it does (by design), and this is also true of mundane, human-programmed software (asking any software engineer who has worked on something more than 1k lines long if their program ever did anything unexpected and I can promise you the answer is “yes”). Thus although we in theory have control of what goes on inside AI, that’s much less the case than it seems at first, so much so that we often have better models of how humans decide to do things than we do for AI.
Elsewhere we sometimes call this the “human alignment problem” and use it as a test case in the sense that if we can’t design a mechanism at least robust enough to solve human alignment we probably can’t use it to solve AI alignment because AIs (especially superhuman AI) are much better optimizers than humans. Some might argue against this, pointing out that humans are fallible in ways that machines are not, but the point is that if you can’t make safe something so bad at optimizing as humans who often look like they are just taking random walks due to a wide variety of reasons, you can’t possibly hope to make safe something that is reliably good at achieving its goals.
But we can decide what goes inside the machine, whereas with people we can only control outside circumstances. It seems to me that such a mechanism would be highly likely to be an internal mechanism, so wouldn’t be applicable to people
We’re in an analogous situation with AI. AI is too complex for us to fully understand what it does (by design), and this is also true of mundane, human-programmed software (asking any software engineer who has worked on something more than 1k lines long if their program ever did anything unexpected and I can promise you the answer is “yes”). Thus although we in theory have control of what goes on inside AI, that’s much less the case than it seems at first, so much so that we often have better models of how humans decide to do things than we do for AI.
Great additional detail, thanks!