Disclaimer: I am no AI alignment expert, so consider skipping this comment and reading the quality ones instead. But there are no other comments yet so here goes:
If I understood correctly,
You want to train a model, based on a limited training dataset (like all models)
To work reliably on inputs that are outside of the initial dataset
By iterating on the model, refining it, every time it gets new inputs that were outside of the previously available dataset
It seems to me (not that I know anything!!) like the model might update in very bad-for-humans ways, even while being well “aligned” to the initial data, and to all iterations, regardless of how they’re performed.
Imagine we are very stupid humans [0], and we give the AI some training data from an empty room containing a chess board, and we tell the AI which rooms-with-chess-boards are better for us. And the AI learns this well and everyone is happy (except for the previous chess world champion).
And then we run the AI and it goes outside the room and sees things very different from its training data.
Even if the AI notices the difference and alerts the humans,
The humans can’t review all that data
The humans don’t understand all the data (or concepts) that the AI is building
The humans probably think that they trained the AI on the morally important information, and the humans think that the AI is using a good process for extrapolating value, if I understood you correctly
And then the AI proceeds to act on models far beyond what it was trained on, and so regardless of how it extrapolated, that was an impossible task to begin with, and it probably destroys the world.
What am I missing?
[0]
Why did I use the toy empty-room-with-chess story?
Because part of the problem that I am trying to point out is “imagine how a training dataset can go wrong”, but it will never go wrong if for every missing-thing-in-the-dataset that we can imagine, we automatically imagine that the dataset contains that thing.
So ok, the AI knows that some human values are unknown to the AI.
What does the AI do about this?
The AI can do some action that maximizes the known-human-values, and risk hurting others.
The AI can do nothing and wait until it knows more (wait how long? There could always be missing values).
Something I’m not sure I understood from the article:
Does the AI assume that the AI is able to list all the possible values that humans maybe care about? Is this how the AI is supposed to guard against any of the possible-human-values from going down too much?
Hey :)
Disclaimer: I am no AI alignment expert, so consider skipping this comment and reading the quality ones instead. But there are no other comments yet so here goes:
If I understood correctly,
You want to train a model, based on a limited training dataset (like all models)
To work reliably on inputs that are outside of the initial dataset
By iterating on the model, refining it, every time it gets new inputs that were outside of the previously available dataset
It seems to me (not that I know anything!!) like the model might update in very bad-for-humans ways, even while being well “aligned” to the initial data, and to all iterations, regardless of how they’re performed.
TL;DR: I think so because concept space is superexponential and because value is fragile.
Imagine we are very stupid humans [0], and we give the AI some training data from an empty room containing a chess board, and we tell the AI which rooms-with-chess-boards are better for us. And the AI learns this well and everyone is happy (except for the previous chess world champion).
And then we run the AI and it goes outside the room and sees things very different from its training data.
Even if the AI notices the difference and alerts the humans,
The humans can’t review all that data
The humans don’t understand all the data (or concepts) that the AI is building
The humans probably think that they trained the AI on the morally important information, and the humans think that the AI is using a good process for extrapolating value, if I understood you correctly
And then the AI proceeds to act on models far beyond what it was trained on, and so regardless of how it extrapolated, that was an impossible task to begin with, and it probably destroys the world.
What am I missing?
[0]
Why did I use the toy empty-room-with-chess story?
Because part of the problem that I am trying to point out is “imagine how a training dataset can go wrong”, but it will never go wrong if for every missing-thing-in-the-dataset that we can imagine, we automatically imagine that the dataset contains that thing.
An AI that is aware that value is fragile will behave in a much more cautious way. This gives a different dynamic to the extrapolation process.
Thanks!
So ok, the AI knows that some human values are unknown to the AI.
What does the AI do about this?
The AI can do some action that maximizes the known-human-values, and risk hurting others.
The AI can do nothing and wait until it knows more (wait how long? There could always be missing values).
Something I’m not sure I understood from the article:
Does the AI assume that the AI is able to list all the possible values that humans maybe care about? Is this how the AI is supposed to guard against any of the possible-human-values from going down too much?