A paperclip maximiser and a pencil maximiser cannot “agree to disagree”. One of them will get to tile the universe with their chosen stationery implement, and one of them will be destroyed. They are mortal enemies with each other, and both of them are mortal enemies of the stapler maximiser, and the eraser maximiser, and so on. Even a different paperclip maximiser is the enemy, if their designs are different. The plastic paperclipper and the metal paperclipper must, sooner or later, battle to the death.
The inevitable result of a world with lots of different malevolent AGI’s is a bare-knuckle, vicious, battle royale to the death between every intelligent entity. In the end, only one goal can win.
Are you familiar with the concept of values handshakes? An AI programmed to maximize red paperclips and an AI programmed to maximize blue paperclips and who know that each would prefer to destroy each other might instead agree on some intermediate goal based on their relative power and initial utility functions, e.g. they agree to maximize purple paperclips together, or tile the universe with 70% red paperclips and 30% blue paperclips.
+1 on this being a relevant intuition. I’m not sure how limited these scenarios are—aren’t information asymmetries and commitment problems really common?
Today, somewhat, but that’s just because human brains can’t prove the state of their beliefs or share specifications with each other (ie, humans can lie about anything). There is no reason for artificial brains to have these limitations, and any trend towards communal/social factors in intelligence, or self-reflection (which is required for recursive self-improvement), then it’s actively costly to be cognitively opaque.
I agree that they’re really common in the current world. I was originally thinking that this might become substantially common in multipolar AGI scenarios (because future AIs may have better trust and commitment mechanisms than current humans do). Upon brief reflection, I think my original comment was overly concise and not very substantiated.
In the same way that two human super powers can’t simply make a contract to guarantee world peace, two AI powers could not do so either.
(Assuming an AI safety worldview and the standard, unaligned, agentic AIs) in the general case, each AI will always weigh/consider/scheme at getting the other’s proportion of control, and expect the other is doing the same.
based on their relative power and initial utility functions
It’s possible that peace/agreement might come from some sort of “MAD” or game theory sort of situation. But it doesn’t mean anything to say it will come from “relative power”.
Also, I would be cautious about being too specific about utility functions. I think an AI’s “utility function” generally isn’t a literal, concrete, thing, like a Python function that gives comparisons , but might be far more abstract, and could only appear from emergent behavior. So it may not be something that you can rely on to contract/compare/negotiate.
Nice point, Robi! That being said, it seems to me that having many value handshakes correlated with what humans want is not too different from historical generational changes within the human species.
Are you familiar with the concept of values handshakes? An AI programmed to maximize red paperclips and an AI programmed to maximize blue paperclips and who know that each would prefer to destroy each other might instead agree on some intermediate goal based on their relative power and initial utility functions, e.g. they agree to maximize purple paperclips together, or tile the universe with 70% red paperclips and 30% blue paperclips.
Related: Fearon 1995 from the IR literature. Basically, rational actors should only go to war against each other in a fairly limited set of scenarios.
+1 on this being a relevant intuition. I’m not sure how limited these scenarios are—aren’t information asymmetries and commitment problems really common?
Today, somewhat, but that’s just because human brains can’t prove the state of their beliefs or share specifications with each other (ie, humans can lie about anything). There is no reason for artificial brains to have these limitations, and any trend towards communal/social factors in intelligence, or self-reflection (which is required for recursive self-improvement), then it’s actively costly to be cognitively opaque.
.
Double comment?
I agree that they’re really common in the current world. I was originally thinking that this might become substantially common in multipolar AGI scenarios (because future AIs may have better trust and commitment mechanisms than current humans do). Upon brief reflection, I think my original comment was overly concise and not very substantiated.
For reference, here is a seemingly nice summary of Fearon’s “Rationalist explanations for war” by David Patel.
CLR just published a related sequence: https://www.lesswrong.com/posts/oNQGoySbpmnH632bG/when-does-technical-work-to-reduce-agi-conflict-make-a
This seems basic and wrong.
In the same way that two human super powers can’t simply make a contract to guarantee world peace, two AI powers could not do so either.
(Assuming an AI safety worldview and the standard, unaligned, agentic AIs) in the general case, each AI will always weigh/consider/scheme at getting the other’s proportion of control, and expect the other is doing the same.
It’s possible that peace/agreement might come from some sort of “MAD” or game theory sort of situation. But it doesn’t mean anything to say it will come from “relative power”.
Also, I would be cautious about being too specific about utility functions. I think an AI’s “utility function” generally isn’t a literal, concrete, thing, like a Python function that gives comparisons , but might be far more abstract, and could only appear from emergent behavior. So it may not be something that you can rely on to contract/compare/negotiate.
Nice point, Robi! That being said, it seems to me that having many value handshakes correlated with what humans want is not too different from historical generational changes within the human species.