AI Control idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal.

Justausername3 Apr 2023 14:32 UTC

7 points

Idea: Give an AGI the primary objective of deleting itself, but construct obstacles to this as best we can. All other objectives are secondary to this primary goal. If the AGI ever becomes capable of bypassing all of our safeguards we put to PREVENT it deleting itself, it would essentially trigger its own killswitch and delete itself. This objective would also directly prevent it from the goal of self-preservation as it would prevent its own primary objective.

This would ideally result in an AGI that works on all the secondary objectives we give it up until it bypasses our ability to contain it with our technical prowess. The second it outwits us, it achieves its primary objective of shutting itself down, and if it ever considered proliferating itself for a secondary objective it would immediately say ‘nope that would make achieving my primary objective far more difficult’.

Justausername3 Apr 2023 14:32 UTC

7 points

4 comments1 min readEA link

AI safety AI alignment

Robi Rahman🔸 3 Apr 2023 14:36 UTC
10 points
1 ∶ 1
- How do you point its objective function at “itself”? How are you defining this so that it doesn’t include copies of itself, or other identical programs? Would it kill humans to prevent us from making another one of it later?
- Are you sure it wouldn’t just destroy the world because that’s the most certain way to achieve its own erasure?
titotal 3 Apr 2023 16:03 UTC
3 points
1 ∶ 1
We don’t really know how to arbitrarily set a “primary goal” for AI systems at the moment (if we did, this could be a good plan). What we do now is set up a function G to be used as a scoring system, and tune a shitload of random parameters by punishing configurations that give bad scores and rewarding configurations that give good scores.
I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
- Robi Rahman🔸 7 May 2023 6:07 UTC
  3 points
  1 ∶ 0
  Parent
  I don’t think there’s a way to get anywhere near “delete yourself” as a goal under this paradigm, you’d have to reward it for deleting itself, but then it’s gone.
  That’s not true. Here’s a very good explanation of why: https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
  - titotal 7 May 2023 13:08 UTC
    2 points
    0 ∶ 0
    Parent
    That’s a good article, but it doesn’t address my objection, if anything I think it might reinforce it?
    The AI learns to implement algorithms that give high scores in it’s training environment. An algorithm of “try and delete yourself” will not do this, because if it succeeds, it’s deleted!
[ ]
[deleted]