“But I’m not sure how the AI would come to understand ‘smart’ human goals without acquiring those goals”
The easiest way to see the flaw with this reasoning is to note that by inserting a negative sign in the objective function we can make the AI aim for the exact opposite of what it would otherwise do. In other words, having x in the training data doesn’t show that the ai will seek x rather than avoid x.
I originally also included a second argument, but I’m now confused about why I was saying this, as it doesn’t really seem analogous. It seems to be showing that occurring in the data != represented in the network function when I need to show: represented in the network != optimised by the objective function.
Second argument: It can also ignore x, we can imagine an AI with lots of colour data trying to identify the shape of dark objects on a white background. In this case, if the objective function only rewards correct guesses and punishes incorrect ones, there’s no incentive for the network to learn to represent colour vs. darkness assuming colour is uncorrelated with the shape.
The point about inserting a negative sign is good (though then there’s a question of who inserts that sign—homicidal maniacs? Is it put in by accident?)
Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn’t understand all the abstruse reasons why we care about various things, but I don’t really care about that (I care about people not being killed or harmed).
There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.
Actually, rereading I don’t really know where I was going with the color example. I think I probably messed up as you said.
You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.
“But I’m not sure how the AI would come to understand ‘smart’ human goals without acquiring those goals”
The easiest way to see the flaw with this reasoning is to note that by inserting a negative sign in the objective function we can make the AI aim for the exact opposite of what it would otherwise do. In other words, having x in the training data doesn’t show that the ai will seek x rather than avoid x.
I originally also included a second argument, but I’m now confused about why I was saying this, as it doesn’t really seem analogous. It seems to be showing that occurring in the data != represented in the network function when I need to show: represented in the network != optimised by the objective function.
Second argument: It can also ignore x, we can imagine an AI with lots of colour data trying to identify the shape of dark objects on a white background. In this case, if the objective function only rewards correct guesses and punishes incorrect ones, there’s no incentive for the network to learn to represent colour vs. darkness assuming colour is uncorrelated with the shape.
The point about inserting a negative sign is good (though then there’s a question of who inserts that sign—homicidal maniacs? Is it put in by accident?)
Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn’t understand all the abstruse reasons why we care about various things, but I don’t really care about that (I care about people not being killed or harmed).
There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.
Actually, rereading I don’t really know where I was going with the color example. I think I probably messed up as you said.
You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.