How “natural” are intended generalizations (like “Do what the supervisor is hoping I’ll do, in the sense that most humans would mean this phrase rather than in a precise but malign sense”) vs. unintended ones (like “Do whatever maximizes reward”)?
I think this is an important point. I consider the question in this paper, published last year at AI Magazine. See the “Competing Models of the Goal” section, and in particular the “Arbitrary Reward Protocols” subsection. (2500 words)
I think there’s something missing from the discussion here, which the key point of that section.First, I claim that sufficiently advanced agents will likely need to engage in hypothesis testing between multiple plausible models of what worldly events lead to reinforcement, or else they would fail at certain tasks. So even if the “intended generalization” is a quite bit more plausible to the agent than the unintended one, as long as it is cheap to test them, and as long as it has a long horizon, it would likely deem wireheading to be worth trying out, just in case. That said, in some situations (I mention a chess game in the paper) I expect the intended generalization to be so much simpler that it isn’t even worth trying out.
Just a warning before you read it, I use the word “reward” a bit differently than you appear to. In my terminology, I would phrase is this as “Do what the supervisor is hoping” vs. “Do whatever maximizes the relevant physical signal”, and the agent would essentially wonder which of the two constitutes “reward”, rather than being a priori sure that its past rewards “are” those physical signals.
Do you have a minute to react to this? Are you satisfied with my response?