There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren’t related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don’t in any case, but I’d guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
I think it’s good practice to have a transparent overview of the current state of evidence
For many people I think real-world examples will be most convincing
I expect there to be more and more real-world examples, so starting to collect them now seems good
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I’m particularly interested in whether there are any good real-world examples of:
Goal misgeneralization
Deceptive alignment (answer: no, but yes to simple deception?)
Specification gaming
Power-seeking
Self-preservation
Self-improvement
This feeds into a project I’m working on with AI Impacts, collecting empirical evidence on various AI risk claims. There’s a work-in-progress table here with the main things I’m tracking so far—additions and comments very welcome.
[Question] Strongest real-world examples supporting AI risk claims?
[Manually cross-posted to LessWrong here.]
There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.
There are also some databases of AI incidents which include lots of real-world examples, but the examples aren’t related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don’t in any case, but I’d guess some do.)
I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:
I think it’s good practice to have a transparent overview of the current state of evidence
For many people I think real-world examples will be most convincing
I expect there to be more and more real-world examples, so starting to collect them now seems good
What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?
I’m particularly interested in whether there are any good real-world examples of:
Goal misgeneralization
Deceptive alignment (answer: no, but yes to simple deception?)
Specification gaming
Power-seeking
Self-preservation
Self-improvement
This feeds into a project I’m working on with AI Impacts, collecting empirical evidence on various AI risk claims. There’s a work-in-progress table here with the main things I’m tracking so far—additions and comments very welcome.