I like the idea of trying to shape the “personalities” of AIs.
Is there a reason to only focus on spite here instead of also trying to make AI personalities less malevolent in general? Malevolent/dark traits, at least in humans, often come together and thus arguably constitute a type of personality (also, spitefulness correlates fairly highly with most other dark traits). (Cf. the dark factor of personality.) I guess we don’t fully understand why these traits seem to cluster together in humans but I think we can’t rule out that they will also cluster together in AIs.
Another undesirable (personality?epistemic?) trait or property (in both AIs and humans) that I’m worried about is ideological fanaticism/extremism (see especially footnote 4 of the link for what I mean by that).
My sense is that ideological fanaticism is arguably:
the opposite of wisdom, terrible epistemics, anti-corrigble.
very hard to cooperate with (very “fussy” in your terminology), very conflict-seeking, not being willing to compromise, extremely non-pluralistic, arguably scoring very low on “having something to lose” (perhaps partly due to the mistaken belief that history/God is on the fanatics’ side and thus even death is not the end).
often goes together with hatred of the outgroup and excessive retributivism (or spite).
It’s unclear if this framing is helpful but I find it interesting that ideological fanaticism seems to encompass most of the undesirable attributes that you outline in this post.[1] So it may be a useful umbrella term for many of the things we don’t want to see in AIs (or the humans controlling AIs).
Also, it sure seems as though ideological fanaticism was responsible for many historical atrocities and we may worry that the future will resemble the past.
Spite (as a preference we might want to reduce in AIs) has just been relatively well-studied compared to other malevolent preferences. If this subfield of AI safety were more mature there might be less emphasis on spite in particular.
(Less confident, haven’t thought that much about this:) It seems conceptually more straightforward what sorts of training environments are conducive to spite, compared to fanaticism (or fussiness or little-to-lose, for that matter).
Regarding 2: I’m totally no expert but it seems to me that there are other ways of influencing the preferences/dispositions of AI—e.g., i) penalizing, say, malevolent or fanatical reasoning/behavior/attitudes (e.g., by telling RLHF raters to specifically look out for such properties and penalize them), or ii) similarly amending the principles and rules of constitutional AI.
Hi David, thanks for expanding the scope to dark traits.
The definition of D is insightful for speculations: “The general tendency to maximize one’s individual utility — disregarding, accepting, or malevolently provoking disutility for others —, accompanied by beliefs that serve as justifications.”
In other words, the “dark” core is “carelessness” (rather than “selfishness”).
I’ve hypothesized that one should expect a careless intelligent system pursuing a careless goal should be expected to exhibit dark traits (increasingly proportional to its intelligence, albeit with increased refinement, too). A system should simply be Machiavellian in pursuit of a goal that doesn’t involve consensual input from other systems.… Some traits may involve the interplay of D with the way the human mind works 😉🤓.
Reflecting on this implies that a “human-controlled AGI in pursuit of a careless goal” would still need to be reigned in compared with an authentically caring AGI (and corresponding goals)..
Great post, thanks for writing!
I like the idea of trying to shape the “personalities” of AIs.
Is there a reason to only focus on spite here instead of also trying to make AI personalities less malevolent in general? Malevolent/dark traits, at least in humans, often come together and thus arguably constitute a type of personality (also, spitefulness correlates fairly highly with most other dark traits). (Cf. the dark factor of personality.) I guess we don’t fully understand why these traits seem to cluster together in humans but I think we can’t rule out that they will also cluster together in AIs.
Another undesirable (personality?epistemic?) trait or property (in both AIs and humans) that I’m worried about is ideological fanaticism/extremism (see especially footnote 4 of the link for what I mean by that).
My sense is that ideological fanaticism is arguably:
the opposite of wisdom, terrible epistemics, anti-corrigble.
very hard to cooperate with (very “fussy” in your terminology), very conflict-seeking, not being willing to compromise, extremely non-pluralistic, arguably scoring very low on “having something to lose” (perhaps partly due to the mistaken belief that history/God is on the fanatics’ side and thus even death is not the end).
often goes together with hatred of the outgroup and excessive retributivism (or spite).
It’s unclear if this framing is helpful but I find it interesting that ideological fanaticism seems to encompass most of the undesirable attributes that you outline in this post.[1] So it may be a useful umbrella term for many of the things we don’t want to see in AIs (or the humans controlling AIs).
Also, it sure seems as though ideological fanaticism was responsible for many historical atrocities and we may worry that the future will resemble the past.
My understanding is that:
Spite (as a preference we might want to reduce in AIs) has just been relatively well-studied compared to other malevolent preferences. If this subfield of AI safety were more mature there might be less emphasis on spite in particular.
(Less confident, haven’t thought that much about this:) It seems conceptually more straightforward what sorts of training environments are conducive to spite, compared to fanaticism (or fussiness or little-to-lose, for that matter).
Thanks Anthony!
Regarding 2: I’m totally no expert but it seems to me that there are other ways of influencing the preferences/dispositions of AI—e.g., i) penalizing, say, malevolent or fanatical reasoning/behavior/attitudes (e.g., by telling RLHF raters to specifically look out for such properties and penalize them), or ii) similarly amending the principles and rules of constitutional AI.
Hi David, thanks for expanding the scope to dark traits.
The definition of D is insightful for speculations: “The general tendency to maximize one’s individual utility — disregarding, accepting, or malevolently provoking disutility for others —, accompanied by beliefs that serve as justifications.”
In other words, the “dark” core is “carelessness” (rather than “selfishness”).
I’ve hypothesized that one should expect a careless intelligent system pursuing a careless goal should be expected to exhibit dark traits (increasingly proportional to its intelligence, albeit with increased refinement, too). A system should simply be Machiavellian in pursuit of a goal that doesn’t involve consensual input from other systems.… Some traits may involve the interplay of D with the way the human mind works 😉🤓.
Reflecting on this implies that a “human-controlled AGI in pursuit of a careless goal” would still need to be reigned in compared with an authentically caring AGI (and corresponding goals)..