I mainly want to say I agree, this seems fishy to me too.
An answer I heard from an agent foundation’s researcher if I remember correctly (I complained about almost the exact same thing) : Humans do have a utility function, but they’re not perfectly approximating it.
I’d add: Specifically, humans have a “feature” of (sometimes) being willing to lose all their money (in expectation) in a casino, and other such things. I don’t think this is such a good safety feature (and also, if I had access to my own code, I’d edit that stuff away). But still this seems unsolved to me and maybe worth discussing more. (maybe MIRI people would just solve it in 5 seconds but not me)
It is interesting to think about the seeming contradiction here. Looking at the von neuman theorem you linked earlier, the specific theorem is about a rational agent choosing between several different options, and saying that if their preferences follow the axioms (no dutch-booking etc), you can build a utility function to describe those preferences.
First of all, humans are not rational, and can be dutch-booked. But even if they were much more rational in their decision making, I don’t think the average person would suddenly switch into “tile the universe to fulfill a mathematical equation” mode (with the possible exception of some people in EA).
Perhaps the problem is that the utility function describing an entities preferences doesn’t need to be constant. Perhaps today I choose to buy pepsi over coke because it’s cheaper, but next week I see a good ad for coke and decide to pay the extra money for the good associations it brings. I don’t think the theorem says anything about that, it seems like the utility just describes my current preferences, and says nothing about how my preferences change over time.
From a neuroscience/psychology perspective, I’d say that you are maximizing your future reward. And while that’s not a well-defined thing, it doesn’t matter; if you were highly competent, you’d make a lot of changes to the world according to what tickles you, and those might or might not be good for others, depending on your preferences (reward function). The slight difference between turning the world into one well-defined thing and a bunch of things you like isn’t that important to anyone who doesn’t like what you like.
This is a broader and more intuitive form of the argument Miles is trying to make precise.
If you can be dutch-booked without limit, well, you’re just not competent enough to be a threat; but you’re not going to let that happen, let alone a superintelligent version of you.
Except for one detail: Humans who hold preferences that don’t comply to the axioms cannot necessarily be “dutch-booked” for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/
“Humans do have a utility function”? I would say that depends on what one means by “have”.
Does it mean that the value of a humans’ life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?
Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.
Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical “as if” argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.
It means humans are highly imperfect maximizers of some imperfectly defined and ever-changing thing: your estimated future rewards according to your current reward function.
It doesn’t matter that you’re not exactly maximizing one certain thing; you’re working toward some set of things, and if you’re really good at that, it’s really bad for anyone who doesn’t like that set of things.
Optimization/maximization is a red herring. Highly compentent agents with goals different from yours is the core problem.
if Yonatan meant it the way you interpret it, I would still respond: Where is the evidence that such a reward function exists and guides humans’ behavior? I spoke to several high-ranking scientists from psychology and social psychology who very much doubt this. I suspect that the theory of humans aiming to maximize reward functions might be a non-testable one, and in that sense “non-scientific” – you might believe in it or not. It helps explaining some stuff, but it is also misleading in other respects. I choose not to believe it until I see evidence.
I also don’t agree that optimization is a red herring. It is a true issue, just not the only one, and maybe not the most severe one (if one believes one can separate out the relative severity of several interlinked issues, which I don’t). I do agree that powerful agents are another big issue, whether competent or not. But powerful, competent, and optimizing agents are certainly the most scary kind :-)
Dear Seth, thank you again for your opinion. I agree that many instrumental goals such as power would be helpful also for final goals that are not of the type “maximize this or that”. But I have yet to see a formal argument that show that they would actually emerge in a non-maximizing agent just as likely as in a maximizer.
Regarding your other claim, I cannot agree that “mismatched goals is the problem”. First of all, why do you think there is just a single problem, “the” problem? And then, is it helpful to consider something a “problem” that is an unchangeable fact of life? As long as there is more than one human who is potentially affected by an AI system’s actions, and these humans’ goals are not matched with each other (which they usually aren’t), no AI system can have goals matched to all humans affected by it. Unless you want to claim that “having matched goals” is not a transitive relation. So I am quite convinced that the fact that AI systems will have mismatched goals is not a problem we can solve but a fact we have to deal with.
I agree with you that humans have mismatched goals among ourselves, so some amount of goal mismatch is just a fact we have to deal with. I think the ideal is that we get an AGI that makes its goal the overlap in human goals; see [Empowerment is (almost) All We Need](https://www.lesswrong.com/posts/JPHeENwRyXn9YFmXc/empowerment-is-almost-all-we-need) and others on preference maximization.
I also agree with your intuition that having a non-maximizer improves the odds of an AGI not seeking power or doing other dangerous things. But I think we need to go far beyond the intuition; we don’t want to play odds with the future of humanity. To that end, I have more thoughts on where this will and won’t happen.
I’m saying “the problem” with optimization is actually mismatched goals, not optimization/maximization. In more depth, and hopefully more usefully: I think unbounded goals are the problem with optimization (not the only problem, but a very big one).
If an AGI had a bounded goal like “make on billion paperclips”, it wouldn’t be nearly as dangerous; it might decide to eliminate humanity to make the odds of getting to a billion as good as possible (I can’t remember where I saw this important point; I think maybe Nate Soares made it). But it might decide that its best odds would just be making some improvements to the paperclip business, in which case it wouldn’t cause problems.
One final comment on your argument about odds: In our algorithms, specifying an allowable aspiration includes specifying a desired probability of success that is sufficiently below 100%. This is exactly to avoid the problem of fulfilling the aspiration becoming an optimization problem through the backdoor.
I mainly want to say I agree, this seems fishy to me too.
An answer I heard from an agent foundation’s researcher if I remember correctly (I complained about almost the exact same thing) : Humans do have a utility function, but they’re not perfectly approximating it.
I’d add: Specifically, humans have a “feature” of (sometimes) being willing to lose all their money (in expectation) in a casino, and other such things. I don’t think this is such a good safety feature (and also, if I had access to my own code, I’d edit that stuff away). But still this seems unsolved to me and maybe worth discussing more. (maybe MIRI people would just solve it in 5 seconds but not me)
It is interesting to think about the seeming contradiction here. Looking at the von neuman theorem you linked earlier, the specific theorem is about a rational agent choosing between several different options, and saying that if their preferences follow the axioms (no dutch-booking etc), you can build a utility function to describe those preferences.
First of all, humans are not rational, and can be dutch-booked. But even if they were much more rational in their decision making, I don’t think the average person would suddenly switch into “tile the universe to fulfill a mathematical equation” mode (with the possible exception of some people in EA).
Perhaps the problem is that the utility function describing an entities preferences doesn’t need to be constant. Perhaps today I choose to buy pepsi over coke because it’s cheaper, but next week I see a good ad for coke and decide to pay the extra money for the good associations it brings. I don’t think the theorem says anything about that, it seems like the utility just describes my current preferences, and says nothing about how my preferences change over time.
From a neuroscience/psychology perspective, I’d say that you are maximizing your future reward. And while that’s not a well-defined thing, it doesn’t matter; if you were highly competent, you’d make a lot of changes to the world according to what tickles you, and those might or might not be good for others, depending on your preferences (reward function). The slight difference between turning the world into one well-defined thing and a bunch of things you like isn’t that important to anyone who doesn’t like what you like.
This is a broader and more intuitive form of the argument Miles is trying to make precise.
If you can be dutch-booked without limit, well, you’re just not competent enough to be a threat; but you’re not going to let that happen, let alone a superintelligent version of you.
I agree.
Except for one detail: Humans who hold preferences that don’t comply to the axioms cannot necessarily be “dutch-booked” for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/
“Humans do have a utility function”? I would say that depends on what one means by “have”.
Does it mean that the value of a humans’ life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?
Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.
Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical “as if” argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.
It means humans are highly imperfect maximizers of some imperfectly defined and ever-changing thing: your estimated future rewards according to your current reward function.
It doesn’t matter that you’re not exactly maximizing one certain thing; you’re working toward some set of things, and if you’re really good at that, it’s really bad for anyone who doesn’t like that set of things.
Optimization/maximization is a red herring. Highly compentent agents with goals different from yours is the core problem.
Dear Seth,
if Yonatan meant it the way you interpret it, I would still respond: Where is the evidence that such a reward function exists and guides humans’ behavior? I spoke to several high-ranking scientists from psychology and social psychology who very much doubt this. I suspect that the theory of humans aiming to maximize reward functions might be a non-testable one, and in that sense “non-scientific” – you might believe in it or not. It helps explaining some stuff, but it is also misleading in other respects. I choose not to believe it until I see evidence.
I also don’t agree that optimization is a red herring. It is a true issue, just not the only one, and maybe not the most severe one (if one believes one can separate out the relative severity of several interlinked issues, which I don’t). I do agree that powerful agents are another big issue, whether competent or not. But powerful, competent, and optimizing agents are certainly the most scary kind :-)
Mismatched goals is the problem. The logic of instrumental convergence applies to any goal, not just maximization goals.
Dear Seth, thank you again for your opinion. I agree that many instrumental goals such as power would be helpful also for final goals that are not of the type “maximize this or that”. But I have yet to see a formal argument that show that they would actually emerge in a non-maximizing agent just as likely as in a maximizer.
Regarding your other claim, I cannot agree that “mismatched goals is the problem”. First of all, why do you think there is just a single problem, “the” problem? And then, is it helpful to consider something a “problem” that is an unchangeable fact of life? As long as there is more than one human who is potentially affected by an AI system’s actions, and these humans’ goals are not matched with each other (which they usually aren’t), no AI system can have goals matched to all humans affected by it. Unless you want to claim that “having matched goals” is not a transitive relation. So I am quite convinced that the fact that AI systems will have mismatched goals is not a problem we can solve but a fact we have to deal with.
I agree with you that humans have mismatched goals among ourselves, so some amount of goal mismatch is just a fact we have to deal with. I think the ideal is that we get an AGI that makes its goal the overlap in human goals; see [Empowerment is (almost) All We Need](https://www.lesswrong.com/posts/JPHeENwRyXn9YFmXc/empowerment-is-almost-all-we-need) and others on preference maximization.
I also agree with your intuition that having a non-maximizer improves the odds of an AGI not seeking power or doing other dangerous things. But I think we need to go far beyond the intuition; we don’t want to play odds with the future of humanity. To that end, I have more thoughts on where this will and won’t happen.
I’m saying “the problem” with optimization is actually mismatched goals, not optimization/maximization. In more depth, and hopefully more usefully: I think unbounded goals are the problem with optimization (not the only problem, but a very big one).
If an AGI had a bounded goal like “make on billion paperclips”, it wouldn’t be nearly as dangerous; it might decide to eliminate humanity to make the odds of getting to a billion as good as possible (I can’t remember where I saw this important point; I think maybe Nate Soares made it). But it might decide that its best odds would just be making some improvements to the paperclip business, in which case it wouldn’t cause problems.
So we’re converging...
One final comment on your argument about odds: In our algorithms, specifying an allowable aspiration includes specifying a desired probability of success that is sufficiently below 100%. This is exactly to avoid the problem of fulfilling the aspiration becoming an optimization problem through the backdoor.