A principle of ādonāt kill more than X number of peopleā seems far simpler and more likely than one saying ādonāt kill more than X number of people, unless you kill 8 billion peopleā.
I think the theme of what is easier to learn among these 2 came up during Ajeya Cotraās recent appearance on The 80,000 Hours Podcast.
Rob Wiblin: Yeah. In the post you point out ways that imperfectly trying to address this issue [negatively rewarding bad behavior from the AI] could end up backfiring, or at least not solving the problem. I think that the basic idea is that if you already have kind of schemy or sycophancy tendencies, then during the training process the people will start getting a bit smarter at catching you out when youāre engaging in schemy behaviour or youāre being deceptive. Then thereās kind of two ways you could go: one way would be to learn, āDeception doesnāt pay. Iāve got to be a Saintā; the other would be, āIāve got to be better at my lying. Iāve just learned that particular lying strategies donāt work, but Iām going to keep the other, smarter lying strategies.ā
Ajeya Cotra: Thatās right.
Rob Wiblin: How big a problem is this?
Ajeya Cotra: I think it is one of the biggest things I worry about.
If we were in a world where basically the AI systems could try sneaky deceptive things that werenāt totally catastrophic ā didnāt go as far as taking over the world in one shot ā and then if we caught them and basically corrected that in the most straightforward way, which is to give that behaviour a negative reward and try and find other cases where it did something similar and give that negative reward, and that just worked, then we would be in a much better place. Because it would mean we can kind of operate iteratively and empirically without having to think really hard about tricky corner cases.
If, in fact, what happens when you give this behaviour a negative reward is that the model just becomes more patient and more careful, then youāll observe the same thing ā which is that you stop seeing that behaviour ā but it means a much scarier implication.
Rob Wiblin: Yeah, it feels like thereās something perverse about this argument, because it seems like it canāt be generally the case that giving negative reward to outcome X or process X then causes it to become extremely good at doing X in a way that you couldnāt pick up. Most of the time when youāre doing reinforcement learning, as you give it positive and negative reinforcement, it tends to get closer to doing the thing that you want. Do we have some reason to think that this is an exceptional case that violates that rule?
Ajeya Cotra: Well, one thing to note is that you do see more of what you want in this world. Youāll see perhaps this model that, instead of writing the code you wanted to write, it went and grabbed the unit tests you were using to test it on and just like special-cased those cases in its code, because that was easier. It does that on Wednesday and it gets a positive reward for it. And then on Thursday you notice the code totally doesnāt work and it just copied and pasted the unit tests. So you go and give it a negative reward instead of a positive reward. Then it does stop doing that ā on Friday, itāll probably just write the code like you asked and not bother doing the unit test thing.
This isnāt a matter of reinforcement learning not working as normal. Iām starting from the premise that it is working as normal, so all this stuff that youāre whacking is getting better. But then itās a question of what does it mean? Like, how is it that it made a change that caused its behaviour to be better in this case? Is it that its motivation ā the initial motivation that caused it to try and deceive you ā is a robust thing, and itās changing basically the time horizon on which it thinks? Is that an easier change to make? Or is it an easier change to make to change its motivation from tendency to be deceitful to tendency not to be deceitful? Thatās just a question that people have different intuitions about.
Rob Wiblin: OK. So I guess itās an empirical question, but as you were saying, people also have different intuitions about it. What do you think the question of which way it would go would hinge on? My intuition is that itās related to what we were talking about earlier, about which mind is more complicated, in terms of if both of them would perform equally well on the subsequent test because itās either gotten better at lying or itās gotten more honest. And if both of those things are rewarded, then I suppose the question is which is more difficult to do. Is that it?
Ajeya Cotra: Thatās definitely one piece of it. The other thing that pushes me to think itāll be more in the direction of deceitful is that this is one thing you caught it doing. Suppose on Wednesday it did this unit-test-hacking thing, and it also fudged some financial logs to make it seem like it made more money than it actually did. On Thursday you actually only noticed the unit test thing; you still havenāt noticed the financial logs editing because that was more subtle. So on Thursday, what does it learn? It isnāt getting a uniform update against all the bad things it did. It got an update against this one bad thing it did, but it still got a positive reward. It was actively pushed up on this second bad thing it did.
Rob Wiblin: I see. Interesting. It might be a bit like if you catch more than half of its deceitful behaviour, then on average itās getting the message that deceit is bad. If, on the other hand, youāre catching less than half, then itās on balance learning that deceit pays in reward.
Ajeya Cotra: I think that is still too crude a way of thinking about it. That might be the way it goes, but actually thereās going to be a pattern to the ones you catch and the ones you donāt catch, right? Itās not just going to be a random half that you catch. Maybe if it were a random half or random 70% of deceitful behaviour you catch, then it would be learning that deceit on balance doesnāt pay.
In reality, I would guess it would be learning a messy combination of things that push in different directions. Youāll probably be catching the more obvious stuff that you check the next day, and youāll be not catching the more subtle stuff that youād have to do a deeper audit to catch. So youāre teaching it: āIn general, humans will catch this type of stuff and they wonāt catch that type of stuff.ā Maybe youāre instilling a bit of a general aversion to deception, but also instilling a preference for the kinds of deception that were rewarded instead of punished.
Great post, titotal!
I think the theme of what is easier to learn among these 2 came up during Ajeya Cotraās recent appearance on The 80,000 Hours Podcast.