Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment. https://twitter.com/whitehatStoic/status/1645758144537034752?t=ps-Ccu42tcScTmWg1qYuqA&s=19
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
Also using fine-tuning with traditional Jungian archetypes allowed GPT2 to tell stories either of depressing or motivational in nature 100%of the time. Thanks for reading!
Hello there,
Are you interested of funding this theory of mine that I submitted to AI alignment awards? I am able to make this work in GPT2 and now writing the results. I was able to make GPT2 shutdown itself (100% of the time) even if it’s aware of the shutdown instruction called “the Gauntlet” embedded through fine-tuning an artificially generated archetype called “the Guardian” essentially solving corrigibility, outer and inner alignment. https://twitter.com/whitehatStoic/status/1645758144537034752?t=ps-Ccu42tcScTmWg1qYuqA&s=19
Let me know if you guys are interested. I want to test it in higher parameter models like Llama and Alpaca but don’t have the means to finance the equipment.
I also found out that there is a weird setting in the temperature for GPT2 where in the range of .498 to .50 my shutdown code works really well, I still don’t know why though. But yeah I believe that there is an incentive to review what’s happening inside the transformer architecture.
Here was my original proposal: https://www.whitehatstoic.com/p/research-proposal-leveraging-jungian
I’ll post my paper for the corrigibility solution too once finished probably next week.
Looking forward to hearing from you.
Best regards,
Miguel
I have submitted an application no need to reply!
The final write-up for my project.
https://www.lesswrong.com/posts/pu6D2EdJiz2mmhxfB/archetypal-transfer-learning-a-proposed-alignment-solution
Also using fine-tuning with traditional Jungian archetypes allowed GPT2 to tell stories either of depressing or motivational in nature 100%of the time. Thanks for reading!