Contrary to their claim that “it would have been very hard to predict that humans would like ice cream, sucralose, or sex with contraception,” I think it was predictable that these preferences would likely result from natural selection under constraints. In each of these examples, a mechanism that evolved to detect the achievement of an instrumentally important subgoal is triggered by a stimulus that is i) very similar to the stimuli an animal would experience when the subgoal is achieved, ii) did not exist in the evolutionary environment. We should expect any (partially or fully) optimized bounded agent to have detectors for the achievement of instrumentally important subgoals. We should expect these detectors to only analyze a limited number of features with limited precision. And we should expect the limited number of comparisons they perform precisely to be optimized for distinctions that were important for success on the training data.
Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments. If we can predict such failures of generalization beyond the training data, then we might be able to either prevent them, mitigate them, or regulate real-world applications so that AI systems won’t be applied to inputs where misclassification is likely and problematic. The latter approach is analogous to outlawing highly addictive drugs that mimic neurotransmitters signalling the achievement of instrumentally important subgoals.
Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments.
Your framework seems to work for simple cases like “ice cream, sucralose, or sex with contraception”, but I don’t think it works for more complex cases like “peacocks would like giant colorful tails”?
There is so much human behaviour also that would have been essentially impossible to predict just from first principles and natural selection under constraints: poetry, chess playing, comedy, monasticism, sports, philosophy, effective altruism. These behaviours seem further removed from your detectors for instrumentally important subgoals, and/or to have a more complex relationship to those detectors, but they’re still widespread and important parts of human life. This seems to support the argument that the relationship between how a mind was evolved (e.g., by natural selection) and what it ends up wanting is unpredictable, possibly in dangerous ways.
Your model might still tell us that generalisation failures are very likely to occur, even if, as I am suggesting, it can’t predict many of the specific ways things will misgeneralise. But I’m not sure this offers much practical guidance when trying to develop safer AI systems. But maybe I’m wrong about that?
I think the post The Selfish Machine by Maarten Boudry is relevant to this discussion.
Consider dogs. Canine evolution under human domestication satisfies Lewontin’s three criteria: variation, heritability, and differential reproduction. But most dogs are bred to be meek and friendly, the very opposite of selfishness. Breeders ruthlessly select against aggression, and any dog attacking a human usually faces severe fitness consequences—it is put down, or at least not allowed to procreate. In the evolution of dogs, humans call the shots, not nature. Some breeds, like pit bulls or Rottweilers, are of course selected for aggression (to other animals, not to its guardian), but that just goes to show that domesticated evolution depends on breeders’ desires.
How can we extend this difference between blind evolution and domestication to the domain of AI? In biology, the defining criterion of domestication is control over reproduction. If humans control an animal’s reproduction, deciding who gets to mate with whom, then it’s domesticated. If animals escape and regain their autonomy, they’re feral. By that criterion, house cats are only partly domesticated, as most moggies roam about unsupervised and choose their own mates, outside of human control. If you apply this framework to AIs, it should be clear that AI systems are still very much in a state of domestication. Selection pressures come from human designers, programmers, consumers, and regulators, not from blind forces. It is true that some AI systems self-improve without direct human supervision, but humans still decide which AIs are developed and released. GPT-4 isn’t autonomously spawning GPT-5 after competing in the wild with different LLMs; humans control its evolution.
By and large, current selective pressures for AI are the opposite of selfishness. We want friendly, cooperative AIs that don’t harm users or produce offensive content. If chatbots engage in dangerous behavior, like encouraging suicide or enticing journalists to leave their spouse, companies will frantically try to update their models and stamp out the unwanted behavior. In fact, some language models have become so safe, avoiding any sensitive topics or giving anodyne answers, that consumers now complain they are boring. And Google became a laughing stock when its image generator proved to be so politically correct as to produce ethnically diverse Vikings or founding fathers.
Contrary to their claim that “it would have been very hard to predict that humans would like ice cream, sucralose, or sex with contraception,” I think it was predictable that these preferences would likely result from natural selection under constraints. In each of these examples, a mechanism that evolved to detect the achievement of an instrumentally important subgoal is triggered by a stimulus that is i) very similar to the stimuli an animal would experience when the subgoal is achieved, ii) did not exist in the evolutionary environment. We should expect any (partially or fully) optimized bounded agent to have detectors for the achievement of instrumentally important subgoals. We should expect these detectors to only analyze a limited number of features with limited precision. And we should expect the limited number of comparisons they perform precisely to be optimized for distinctions that were important for success on the training data.
Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments. If we can predict such failures of generalization beyond the training data, then we might be able to either prevent them, mitigate them, or regulate real-world applications so that AI systems won’t be applied to inputs where misclassification is likely and problematic. The latter approach is analogous to outlawing highly addictive drugs that mimic neurotransmitters signalling the achievement of instrumentally important subgoals.
Interesting!
Your framework seems to work for simple cases like “ice cream, sucralose, or sex with contraception”, but I don’t think it works for more complex cases like “peacocks would like giant colorful tails”?
There is so much human behaviour also that would have been essentially impossible to predict just from first principles and natural selection under constraints: poetry, chess playing, comedy, monasticism, sports, philosophy, effective altruism. These behaviours seem further removed from your detectors for instrumentally important subgoals, and/or to have a more complex relationship to those detectors, but they’re still widespread and important parts of human life. This seems to support the argument that the relationship between how a mind was evolved (e.g., by natural selection) and what it ends up wanting is unpredictable, possibly in dangerous ways.
Your model might still tell us that generalisation failures are very likely to occur, even if, as I am suggesting, it can’t predict many of the specific ways things will misgeneralise. But I’m not sure this offers much practical guidance when trying to develop safer AI systems. But maybe I’m wrong about that?
I think the post The Selfish Machine by Maarten Boudry is relevant to this discussion.
Thanks for the great point, Falk. I very much agree.