Back in July, you held an in-person Q&A at REACH and said “There are a bunch of things about AI alignment which I think are pretty important but which aren’t written up online very well. One thing I hope to do at this Q&A is try saying these things to people and see whether people think they make sense.” Could you say more about what these important things are, and what was discussed at the Q&A?
I don’t really remember what was discussed at the Q&A, but I can try to name important things about AI safety which I think aren’t as well known as they should be. Here are some:
----
I think the ideas described in the paper Risks from Learned Optimization are extremely important; they’re less underrated now that the paper has been released, but I still wish that more people who are interested in AI safety understood those ideas better. In particular, the distinction between inner and outer alignment makes my concerns about aligning powerful ML systems much crisper.
----
On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.
----
Compared to people who are relatively new to the field, skilled and experienced AI safety researchers seem to have a much more holistic and much more concrete mindset when they’re talking about plans to align AGI.
For example, here are some of my beliefs about AI alignment (none of which are original ideas of mine):
--
I think it’s pretty plausible that meta-learning systems are going to be a bunch more powerful than non-meta-learning systems at tasks like solving math problems. I’m concerned that by default meta-learning systems are going to exhibit alignment problems, for example deceptive misalignment. You could solve this with some combination of adversarial training and transparency techniques. In particular, I think that to avoid deceptive misalignment you could use a combination of the following components:
Some restriction of what ML techniques you use
Some kind of regularization of your model to push it towards increased transparency
Neural net interpretability techniques
Some adversarial setup, where you’re using your system to answer questions about whether there exist questions that would cause it to behave unacceptably.
Each of these components can be stronger or weaker, where by stronger I mean “more restrictive but having more nice properties”.
The stronger you can build one of those components, the weaker the others can be. For example, if you have some kind of regularization that you can do to increase transparency, you don’t have to have neural net interpretability techniques that are as powerful. And if you have a more powerful and reliable adversarial setup, you don’t need to have as much restriction on what ML techniques you can use.
And I think you can get the adversarial setup to be powerful enough to catch non-deceptive mesa optimizer misalignment, but I don’t think you can prevent deceptive misalignment without having powerful enough interpretability techniques that you can get around things like the RSA 2048 problem.
--
In the above arguments, I’m looking at the space of possible solutions to a problem and trying to narrow the possibility space, by spotting better solutions to subproblems or reducing subproblems to one another, and by arguing that it’s impossible to come up with a solution of a particular type.
The key thing that I didn’t use to do is thinking of the alignment problem as having components which can be attacked separately, and thinking of solutions to subproblems as being comprised of some combination of technologies which can be thought about independently. I used to think of AI alignment as being more about looking for a single overall story for everything, as opposed to looking for a combination of technologies which together allow you to build an aligned AGI.
On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.
+1, this is the thing that surprised me most when I got into the field. I think helping increase common knowledge and agreement on the big picture of safety should be a major priority for people in the field (and it’s something I’m putting a lot of effort into, so send me an email at richardcngo@gmail.com if you want to discuss this).
Back in July, you held an in-person Q&A at REACH and said “There are a bunch of things about AI alignment which I think are pretty important but which aren’t written up online very well. One thing I hope to do at this Q&A is try saying these things to people and see whether people think they make sense.” Could you say more about what these important things are, and what was discussed at the Q&A?
I don’t really remember what was discussed at the Q&A, but I can try to name important things about AI safety which I think aren’t as well known as they should be. Here are some:
----
I think the ideas described in the paper Risks from Learned Optimization are extremely important; they’re less underrated now that the paper has been released, but I still wish that more people who are interested in AI safety understood those ideas better. In particular, the distinction between inner and outer alignment makes my concerns about aligning powerful ML systems much crisper.
----
On a meta note: Different people who work on AI alignment have radically different pictures of what the development of AI will look like, what the alignment problem is, and what solutions might look like.
----
Compared to people who are relatively new to the field, skilled and experienced AI safety researchers seem to have a much more holistic and much more concrete mindset when they’re talking about plans to align AGI.
For example, here are some of my beliefs about AI alignment (none of which are original ideas of mine):
--
I think it’s pretty plausible that meta-learning systems are going to be a bunch more powerful than non-meta-learning systems at tasks like solving math problems. I’m concerned that by default meta-learning systems are going to exhibit alignment problems, for example deceptive misalignment. You could solve this with some combination of adversarial training and transparency techniques. In particular, I think that to avoid deceptive misalignment you could use a combination of the following components:
Some restriction of what ML techniques you use
Some kind of regularization of your model to push it towards increased transparency
Neural net interpretability techniques
Some adversarial setup, where you’re using your system to answer questions about whether there exist questions that would cause it to behave unacceptably.
Each of these components can be stronger or weaker, where by stronger I mean “more restrictive but having more nice properties”.
The stronger you can build one of those components, the weaker the others can be. For example, if you have some kind of regularization that you can do to increase transparency, you don’t have to have neural net interpretability techniques that are as powerful. And if you have a more powerful and reliable adversarial setup, you don’t need to have as much restriction on what ML techniques you can use.
And I think you can get the adversarial setup to be powerful enough to catch non-deceptive mesa optimizer misalignment, but I don’t think you can prevent deceptive misalignment without having powerful enough interpretability techniques that you can get around things like the RSA 2048 problem.
--
In the above arguments, I’m looking at the space of possible solutions to a problem and trying to narrow the possibility space, by spotting better solutions to subproblems or reducing subproblems to one another, and by arguing that it’s impossible to come up with a solution of a particular type.
The key thing that I didn’t use to do is thinking of the alignment problem as having components which can be attacked separately, and thinking of solutions to subproblems as being comprised of some combination of technologies which can be thought about independently. I used to think of AI alignment as being more about looking for a single overall story for everything, as opposed to looking for a combination of technologies which together allow you to build an aligned AGI.
You can see examples of this style of reasoning in Eliezer’s objections to capability amplification, or Paul on worst-case guarantees, or many other places.
+1, this is the thing that surprised me most when I got into the field. I think helping increase common knowledge and agreement on the big picture of safety should be a major priority for people in the field (and it’s something I’m putting a lot of effort into, so send me an email at richardcngo@gmail.com if you want to discuss this).
Also +1 on this.