So there’s this core question: “how are the results of this project going to help with the superintelligence alignment problem?” My claim can be broken down as follows:
“The problem is relevant”: There’s a part of the superintelligence alignment problem that is analogous to this problem. I think the problem is relevant for reasons I already tried to spell out here.
“The solution is relevant”: There’s something helpful about getting better at solving this problem. This is what I think you’re asking about, and I haven’t talked as much about why I think the solution is relevant, so I’ll do that here.
I don’t think that the process we develop will generalize, in the sense that I don’t think that we’ll be able to actually apply it to solving the problems we actually care about, but I think it’s still likely to be a useful step.
There are more advanced techniques that have been proposed for ensuring models don’t do bad things. For example, relaxed adversarial training, or adversarial training where the humans have access to powerful tools that help them find examples where the model does bad things (eg as in proposal 2 here). But it seems easier to research those things once we’ve done this research, for a few reasons:
It’s nice to have baselines. In general, when you’re doing ML, if you’re trying to develop some new technique that you think will get around fundamental weaknesses of a previous technique, it’s important to start out by getting a clear understanding of how good existing techniques are. ML research often has a problem where people publish papers that claim that some technique is better than the existing technique, and then it turns out that the existing technique is actually just as good if you use it properly (which of course the researchers are incentivized not to do). This kind of problem makes it harder to understand where your improvements are coming from. And so it seems good to try pretty hard to apply the naive adversarial training scheme before moving on to more complicated things.
There are some shared subproblems between the techniques we’re using and the more advanced techniques. For example, there are more advanced techniques where you try to build powerful ML-based tools to help humans generate adversarial examples. There’s kind of a smooth continuum between the techniques we’re trying out and techniques where the humans have access to tools to help them. And so many of the practical details we’re sorting out with our current work will make it easier to test out these more advanced techniques later, if we want to.
I often think of our project as being kind of analogous to Learning to summarize with human feedback. That paper isn’t claiming that if we know how to train models by getting humans to choose which of two options they prefer, we’ll have solved the whole alignment problem. But it’s still probably the case that it’s helpful for us to have sorted out some of the basic questions about how to do training from human feedback, before trying to move on to more advanced techniques (like training using human feedback where the humans have access to ML tools to help them provide better feedback).
So there’s this core question: “how are the results of this project going to help with the superintelligence alignment problem?” My claim can be broken down as follows:
“The problem is relevant”: There’s a part of the superintelligence alignment problem that is analogous to this problem. I think the problem is relevant for reasons I already tried to spell out here.
“The solution is relevant”: There’s something helpful about getting better at solving this problem. This is what I think you’re asking about, and I haven’t talked as much about why I think the solution is relevant, so I’ll do that here.
I don’t think that the process we develop will generalize, in the sense that I don’t think that we’ll be able to actually apply it to solving the problems we actually care about, but I think it’s still likely to be a useful step.
There are more advanced techniques that have been proposed for ensuring models don’t do bad things. For example, relaxed adversarial training, or adversarial training where the humans have access to powerful tools that help them find examples where the model does bad things (eg as in proposal 2 here). But it seems easier to research those things once we’ve done this research, for a few reasons:
It’s nice to have baselines. In general, when you’re doing ML, if you’re trying to develop some new technique that you think will get around fundamental weaknesses of a previous technique, it’s important to start out by getting a clear understanding of how good existing techniques are. ML research often has a problem where people publish papers that claim that some technique is better than the existing technique, and then it turns out that the existing technique is actually just as good if you use it properly (which of course the researchers are incentivized not to do). This kind of problem makes it harder to understand where your improvements are coming from. And so it seems good to try pretty hard to apply the naive adversarial training scheme before moving on to more complicated things.
There are some shared subproblems between the techniques we’re using and the more advanced techniques. For example, there are more advanced techniques where you try to build powerful ML-based tools to help humans generate adversarial examples. There’s kind of a smooth continuum between the techniques we’re trying out and techniques where the humans have access to tools to help them. And so many of the practical details we’re sorting out with our current work will make it easier to test out these more advanced techniques later, if we want to.
I often think of our project as being kind of analogous to Learning to summarize with human feedback. That paper isn’t claiming that if we know how to train models by getting humans to choose which of two options they prefer, we’ll have solved the whole alignment problem. But it’s still probably the case that it’s helpful for us to have sorted out some of the basic questions about how to do training from human feedback, before trying to move on to more advanced techniques (like training using human feedback where the humans have access to ML tools to help them provide better feedback).