Robi Rahman🔸 comments on AI Control idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal.

Robi Rahman🔸 7 May 2023 6:07 UTC
3 points
1 ∶ 0
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
That’s not true. Here’s a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
- titotal 7 May 2023 13:08 UTC
  2 points
  0 ∶ 0
  Parent
  That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
  The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!