[Manually cross-posted to LessWrong here.]
There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren’t related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don’t in any case, but I’d guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
I think it’s good practice to have a transparent overview of the current state of evidence
For many people I think real-world examples will be most convincing
I expect there to be more and more real-world examples, so starting to collect them now seems good
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I’m particularly interested in whether there are any good real-world examples of:
Goal misgeneralization
Deceptive alignment (answer: no, but yes to simple deception?)
Specification gaming
Power-seeking
Self-preservation
Self-improvement
This feeds into a project I’m working on with AI Impacts, collecting empirical evidence on various AI risk claims. There’s a work-in-progress table here with the main things I’m tracking so far—additions and comments very welcome.
For deception (not deceptive alignment) - AI Deception: A Survey of Examples, Risks, and Potential Solutions (section 2)
Break self-improvement into four:
ML optimizing ML inputs: reduced data centre energy cost, reduced cost of acquiring training data, supposedly improved semiconductor designs.
ML aiding ML researchers. e.g. >3% of new Google code is now auto-suggested without amendment.
ML replacing parts of ML research. Nothing too splashy but steady progress: automatic data cleaning and feature engineering, autodiff (and symbolic differentiation!), meta-learning network components (activation functions, optimizers, …), neural architecture search.
Classic direct recursion. Self-play (AlphaGo) is the most striking example but it doesn’t generalise, so far. Purported examples with unclear practical significance: Algorithm Distillation and models finetuned on their own output.[1]
See also this list
Treachery:
https://arxiv.org/abs/2102.07716
https://lukemuehlhauser.com/treacherous-turns-in-the-wild/
The proliferation of crappy bootleg LLaMA finetunes using GPT as training data (and collapsing when out of distribution) makes me a bit cooler about these results in hindsight.
Thanks, really helpful!
Buckman’s examples are not central to what you want but worth reading: https://jacobbuckman.com/2022-09-07-recursively-self-improving-ai-is-already-here/
From Specification gaming examples in AI:
Roomba: “I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back.”
I guess this counts as real-world?
Bing—manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
To be honest, I don’t understand the link to specification gaming here
Bing—threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
To be honest, I don’t understand the link to specification gaming here