The point about inserting a negative sign is good (though then there’s a question of who inserts that sign—homicidal maniacs? Is it put in by accident?)
Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn’t understand all the abstruse reasons why we care about various things, but I don’t really care about that (I care about people not being killed or harmed).
There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.
Actually, rereading I don’t really know where I was going with the color example. I think I probably messed up as you said.
You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.
The point about inserting a negative sign is good (though then there’s a question of who inserts that sign—homicidal maniacs? Is it put in by accident?)
Re the colour example, this seems disanalogous (unless I misunderstand) because if the AI is correctly identifying human-aligned actions and performing them, that means it understands our goals to the extent we care about. Like maybe not all of them, and maybe it doesn’t understand all the abstruse reasons why we care about various things, but I don’t really care about that (I care about people not being killed or harmed).
There was an example where some group accidentally performed a large run where they trained the AI to be maximally offensive rather than minimally offensive.
Actually, rereading I don’t really know where I was going with the color example. I think I probably messed up as you said.
You could also imagine a situation something like a property being defined by a PCA component, hence not being robust to inversion because PCA components are only unique up to multiplication by a scalar.