Given that these failures were predictable, it should be possible to systematically predict many analogous failures that might result from training AI systems on specific data sets or (simulated) environments.
Your framework seems to work for simple cases like âice cream, sucralose, or sex with contraceptionâ, but I donât think it works for more complex cases like âpeacocks would like giant colorful tailsâ?
There is so much human behaviour also that would have been essentially impossible to predict just from first principles and natural selection under constraints: poetry, chess playing, comedy, monasticism, sports, philosophy, effective altruism. These behaviours seem further removed from your detectors for instrumentally important subgoals, and/âor to have a more complex relationship to those detectors, but theyâre still widespread and important parts of human life. This seems to support the argument that the relationship between how a mind was evolved (e.g., by natural selection) and what it ends up wanting is unpredictable, possibly in dangerous ways.
Your model might still tell us that generalisation failures are very likely to occur, even if, as I am suggesting, it canât predict many of the specific ways things will misgeneralise. But Iâm not sure this offers much practical guidance when trying to develop safer AI systems. But maybe Iâm wrong about that?
I think the post The Selfish Machine by Maarten Boudry is relevant to this discussion.
Consider dogs. Canine evolution under human domestication satisfies Lewontinâs three criteria: variation, heritability, and differential reproduction. But most dogs are bred to be meek and friendly, the very opposite of selfishness. Breeders ruthlessly select against aggression, and any dog attacking a human usually faces severe fitness consequencesâit is put down, or at least not allowed to procreate. In the evolution of dogs, humans call the shots, not nature. Some breeds, like pit bulls or Rottweilers, are of course selected for aggression (to other animals, not to its guardian), but that just goes to show that domesticated evolution depends on breedersâ desires.
How can we extend this difference between blind evolution and domestication to the domain of AI? In biology, the defining criterion of domestication is control over reproduction. If humans control an animalâs reproduction, deciding who gets to mate with whom, then itâs domesticated. If animals escape and regain their autonomy, theyâre feral. By that criterion, house cats are only partly domesticated, as most moggies roam about unsupervised and choose their own mates, outside of human control. If you apply this framework to AIs, it should be clear that AI systems are still very much in a state of domestication. Selection pressures come from human designers, programmers, consumers, and regulators, not from blind forces. It is true that some AI systems self-improve without direct human supervision, but humans still decide which AIs are developed and released. GPT-4 isnât autonomously spawning GPT-5 after competing in the wild with different LLMs; humans control its evolution.
By and large, current selective pressures for AI are the opposite of selfishness. We want friendly, cooperative AIs that donât harm users or produce offensive content. If chatbots engage in dangerous behavior, like encouraging suicide or enticing journalists to leave their spouse, companies will frantically try to update their models and stamp out the unwanted behavior. In fact, some language models have become so safe, avoiding any sensitive topics or giving anodyne answers, that consumers now complain they are boring. And Google became a laughing stock when its image generator proved to be so politically correct as to produce ethnically diverse Vikings or founding fathers.
Interesting!
Your framework seems to work for simple cases like âice cream, sucralose, or sex with contraceptionâ, but I donât think it works for more complex cases like âpeacocks would like giant colorful tailsâ?
There is so much human behaviour also that would have been essentially impossible to predict just from first principles and natural selection under constraints: poetry, chess playing, comedy, monasticism, sports, philosophy, effective altruism. These behaviours seem further removed from your detectors for instrumentally important subgoals, and/âor to have a more complex relationship to those detectors, but theyâre still widespread and important parts of human life. This seems to support the argument that the relationship between how a mind was evolved (e.g., by natural selection) and what it ends up wanting is unpredictable, possibly in dangerous ways.
Your model might still tell us that generalisation failures are very likely to occur, even if, as I am suggesting, it canât predict many of the specific ways things will misgeneralise. But Iâm not sure this offers much practical guidance when trying to develop safer AI systems. But maybe Iâm wrong about that?
I think the post The Selfish Machine by Maarten Boudry is relevant to this discussion.