We don’t really know how to arbitrarily set a “primary goal” for AI systems at the moment (if we did, this could be a good plan). What we do now is set up a function G to be used as a scoring system, and tune a shitload of random parameters by punishing configurations that give bad scores and rewarding configurations that give good scores.
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!
We don’t really know how to arbitrarily set a “primary goal” for AI systems at the moment (if we did, this could be a good plan). What we do now is set up a function G to be used as a scoring system, and tune a shitload of random parameters by punishing configurations that give bad scores and rewarding configurations that give good scores.
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
That’s not true. Here’s a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!