I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!
That’s not true. Here’s a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!