You can’t solve every possible jailbreak, but you should solve every jailbreak humanly possible if you’re to release an AI that is claimed to be almost superhuman at cyber skills. I think current models are mostly bad for society, but I also think there’s a possibility that current models could achieve AGI. Maybe it’s only a 4% chance, but again, why take the risk? what is there to gain (other than money)?
I don’t understand how publicly releasing these models will help in researching AI safety (and when I say “AI safety” I mostly mean AGI alignment). I thought the whole point of an aligned AGI is that you don’t have to tell it to do stuff correctly, it already knows what’s correct, even more than you, so I don’t see how letting anyone use the models will help in aligning them. I’m not an AI expert or anything, but to me it seems aligning AGI is less of a “we don’t have enough data” problem and more of a “we don’t even know where to start” problem.
Why? I would find an AI expert is much more suited to align a potential AGI than any common person. I just don’t see how the common person could contribute to alignment. If anything, I can see how they would contribute to DISalignment (engineering better jailbreaks, using the models for nefarious purposes, giving the models “bad values” (like “cause as much damage as possible”), etc.). I think I value existential risk above all else, and I can’t imagine publicly releasing “almost superhuman” models can decrease it.