A principle of “don’t kill more than X number of people” seems far simpler and more likely than one saying “don’t kill more than X number of people, unless you kill 8 billion people”.
I think the theme of what is easier to learn among these 2 came up during Ajeya Cotra’s recent appearance on The 80,000 Hours Podcast.
Rob Wiblin: Yeah. In the post you point out ways that imperfectly trying to address this issue [negatively rewarding bad behavior from the AI] could end up backfiring, or at least not solving the problem. I think that the basic idea is that if you already have kind of schemy or sycophancy tendencies, then during the training process the people will start getting a bit smarter at catching you out when you’re engaging in schemy behaviour or you’re being deceptive. Then there’s kind of two ways you could go: one way would be to learn, “Deception doesn’t pay. I’ve got to be a Saint”; the other would be, “I’ve got to be better at my lying. I’ve just learned that particular lying strategies don’t work, but I’m going to keep the other, smarter lying strategies.”
Ajeya Cotra: That’s right.
Rob Wiblin: How big a problem is this?
Ajeya Cotra: I think it is one of the biggest things I worry about.
If we were in a world where basically the AI systems could try sneaky deceptive things that weren’t totally catastrophic — didn’t go as far as taking over the world in one shot — and then if we caught them and basically corrected that in the most straightforward way, which is to give that behaviour a negative reward and try and find other cases where it did something similar and give that negative reward, and that just worked, then we would be in a much better place. Because it would mean we can kind of operate iteratively and empirically without having to think really hard about tricky corner cases.
If, in fact, what happens when you give this behaviour a negative reward is that the model just becomes more patient and more careful, then you’ll observe the same thing — which is that you stop seeing that behaviour — but it means a much scarier implication.
Rob Wiblin: Yeah, it feels like there’s something perverse about this argument, because it seems like it can’t be generally the case that giving negative reward to outcome X or process X then causes it to become extremely good at doing X in a way that you couldn’t pick up. Most of the time when you’re doing reinforcement learning, as you give it positive and negative reinforcement, it tends to get closer to doing the thing that you want. Do we have some reason to think that this is an exceptional case that violates that rule?
Ajeya Cotra: Well, one thing to note is that you do see more of what you want in this world. You’ll see perhaps this model that, instead of writing the code you wanted to write, it went and grabbed the unit tests you were using to test it on and just like special-cased those cases in its code, because that was easier. It does that on Wednesday and it gets a positive reward for it. And then on Thursday you notice the code totally doesn’t work and it just copied and pasted the unit tests. So you go and give it a negative reward instead of a positive reward. Then it does stop doing that — on Friday, it’ll probably just write the code like you asked and not bother doing the unit test thing.
This isn’t a matter of reinforcement learning not working as normal. I’m starting from the premise that it is working as normal, so all this stuff that you’re whacking is getting better. But then it’s a question of what does it mean? Like, how is it that it made a change that caused its behaviour to be better in this case? Is it that its motivation — the initial motivation that caused it to try and deceive you — is a robust thing, and it’s changing basically the time horizon on which it thinks? Is that an easier change to make? Or is it an easier change to make to change its motivation from tendency to be deceitful to tendency not to be deceitful? That’s just a question that people have different intuitions about.
Rob Wiblin: OK. So I guess it’s an empirical question, but as you were saying, people also have different intuitions about it. What do you think the question of which way it would go would hinge on? My intuition is that it’s related to what we were talking about earlier, about which mind is more complicated, in terms of if both of them would perform equally well on the subsequent test because it’s either gotten better at lying or it’s gotten more honest. And if both of those things are rewarded, then I suppose the question is which is more difficult to do. Is that it?
Ajeya Cotra: That’s definitely one piece of it. The other thing that pushes me to think it’ll be more in the direction of deceitful is that this is one thing you caught it doing. Suppose on Wednesday it did this unit-test-hacking thing, and it also fudged some financial logs to make it seem like it made more money than it actually did. On Thursday you actually only noticed the unit test thing; you still haven’t noticed the financial logs editing because that was more subtle. So on Thursday, what does it learn? It isn’t getting a uniform update against all the bad things it did. It got an update against this one bad thing it did, but it still got a positive reward. It was actively pushed up on this second bad thing it did.
Rob Wiblin: I see. Interesting. It might be a bit like if you catch more than half of its deceitful behaviour, then on average it’s getting the message that deceit is bad. If, on the other hand, you’re catching less than half, then it’s on balance learning that deceit pays in reward.
Ajeya Cotra: I think that is still too crude a way of thinking about it. That might be the way it goes, but actually there’s going to be a pattern to the ones you catch and the ones you don’t catch, right? It’s not just going to be a random half that you catch. Maybe if it were a random half or random 70% of deceitful behaviour you catch, then it would be learning that deceit on balance doesn’t pay.
In reality, I would guess it would be learning a messy combination of things that push in different directions. You’ll probably be catching the more obvious stuff that you check the next day, and you’ll be not catching the more subtle stuff that you’d have to do a deeper audit to catch. So you’re teaching it: “In general, humans will catch this type of stuff and they won’t catch that type of stuff.” Maybe you’re instilling a bit of a general aversion to deception, but also instilling a preference for the kinds of deception that were rewarded instead of punished.
Great post, titotal!
I think the theme of what is easier to learn among these 2 came up during Ajeya Cotra’s recent appearance on The 80,000 Hours Podcast.