I don’t see how AGIs could develop a provably, interpretably, ‘very sophisticated understanding of human values’ if alignment researchers don’t have a sophisticated understanding of human values that they could test against the AGI’s understanding.
I don’t think anyone is aiming for provable alignment properties (except maybe for Stuart Russell); this just seems too hard.
But if AGIs could develop a very sophisticated understanding of other domains that humans don’t understand very well, by virtue of being more intelligent than humans, I don’t see why they wouldn’t be able to understand this domain very well too.
At least, it seems like we’d need a strong ‘training set’ of human values
This is how classic ML would do it. But in the modern paradigm, ML systems can infer all sorts of information from being trained on a very wide range of data (e.g. all the books, all the internet, etc), and so we should expect that they can infer human values from that too. There’s some preliminary evidence that language models can perform well on common-sense moral reasoning, and alignment researchers generally expect that future language models will be capable of answering questions about ethics to a superhuman level “by default”.
More generally, it sounds like you’re gesturing towards the difference between “narrow alignment” and “ambitious alignment”, as discussed in this blog post. Broadly speaking, the goal of the former is basically to have AI that can be controlled; the goal of the latter is to have AI that could be trusted steer the world. One reason that most researchers focus on the former is because if we could narrowly align AI, we could then use it to help us with the more complex task of ambitious alignment. And the properties required for an AI to be narrowly aligned (like “helpful”, “honest”, etc) are sufficiently common-sense that I don’t think we gain much from a very in-depth study of them.
I don’t think anyone is aiming for provable alignment properties (except maybe for Stuart Russell); this just seems too hard.
But if AGIs could develop a very sophisticated understanding of other domains that humans don’t understand very well, by virtue of being more intelligent than humans, I don’t see why they wouldn’t be able to understand this domain very well too.
This is how classic ML would do it. But in the modern paradigm, ML systems can infer all sorts of information from being trained on a very wide range of data (e.g. all the books, all the internet, etc), and so we should expect that they can infer human values from that too. There’s some preliminary evidence that language models can perform well on common-sense moral reasoning, and alignment researchers generally expect that future language models will be capable of answering questions about ethics to a superhuman level “by default”.
More generally, it sounds like you’re gesturing towards the difference between “narrow alignment” and “ambitious alignment”, as discussed in this blog post. Broadly speaking, the goal of the former is basically to have AI that can be controlled; the goal of the latter is to have AI that could be trusted steer the world. One reason that most researchers focus on the former is because if we could narrowly align AI, we could then use it to help us with the more complex task of ambitious alignment. And the properties required for an AI to be narrowly aligned (like “helpful”, “honest”, etc) are sufficiently common-sense that I don’t think we gain much from a very in-depth study of them.
Richard—thanks very much for your quick and helpful reply. I’ll have a look at the links you included, and ruminate about this further...