titotal comments on AI Control idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal.

titotal 3 Apr 2023 16:03 UTC
3 points
1 ∶ 1
We don’t really know how to arbitrarily set a “primary goal” for AI systems at the moment (if we did, this could be a good plan). What we do now is set up a function G to be used as a scoring system, and tune a shitload of random parameters by punishing configurations that give bad scores and rewarding configurations that give good scores.
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
- Robi Rahman🔸 7 May 2023 6:07 UTC
  3 points
  1 ∶ 0
  Parent
  I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
  That’s not true. Here’s a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
  - titotal 7 May 2023 13:08 UTC
    2 points
    0 ∶ 0
    Parent
    That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
    The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!