You can’t solve every possible jailbreak, but you should solve every jailbreak humanly possible if you’re to release an AI that is claimed to be almost superhuman at cyber skills. I think current models are mostly bad for society, but I also think there’s a possibility that current models could achieve AGI. Maybe it’s only a 4% chance, but again, why take the risk? what is there to gain (other than money)?
I don’t understand how publicly releasing these models will help in researching AI safety (and when I say “AI safety” I mostly mean AGI alignment). I thought the whole point of an aligned AGI is that you don’t have to tell it to do stuff correctly, it already knows what’s correct, even more than you, so I don’t see how letting anyone use the models will help in aligning them. I’m not an AI expert or anything, but to me it seems aligning AGI is less of a “we don’t have enough data” problem and more of a “we don’t even know where to start” problem.
We don’t know how to align a possible AGI yet. The best we can hope for is that current models are close enough to whatever AGI is going to be, that trying to align them will teach us about aligning an AGI. This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies
In principle I agree.
But would you say that people’s suitability to align AI safely (or more specifically ensuring that Fable does not write nasty software exploits) is defined less by their expertise and alignment with Anthropic’s stated mission and more by how much money they can spend on credits?
Because that’s what Anthropic and the impending IPO marketing is asking you to believe
(tbh I’m not concerned by Fable manipulating its way into world domination. But if I was, I’d be extremely concerned that our most dedicated defenders against manipulative AI agents might be the sort of people who still take statements put out by AI companies at face value)
This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
Why? I would find an AI expert is much more suited to align a potential AGI than any common person. I just don’t see how the common person could contribute to alignment. If anything, I can see how they would contribute to DISalignment (engineering better jailbreaks, using the models for nefarious purposes, giving the models “bad values” (like “cause as much damage as possible”), etc.). I think I value existential risk above all else, and I can’t imagine publicly releasing “almost superhuman” models can decrease it.
But you’re not claiming that the models should only be shared with AI researchers. You’re claiming they should only be shared with AI researchers specifically employed by Anthropic.
Although no, I disagree that the input from non-AI-researchers is useless here—as you need to hear both from the end users and from people affected by AI and its decisions.
I’m thinking more of the “endgame” here, so I think the input from non-researchers is no more valuable than the input of the researchers (as in, any useful information you could obtain about AI safety can be obtained just from the researchers alone). To be specific, I believe something along the lines of AI 2027 is gonna be the somewhat-near future, so I wanna restrict access to advanced models as much as possible.
Think of it like nuclear bombs. If you had a technology that powerful, you wouldn’t want to risk any bad actors getting access to it, so you limit the amount of owners as much as possible. It would be pretty ridiculous to want private companies to be able to own, or even use nuclear weapons, and I think the case is pretty similar for current and future AI.
You can’t solve every possible jailbreak, but you should solve every jailbreak humanly possible if you’re to release an AI that is claimed to be almost superhuman at cyber skills. I think current models are mostly bad for society, but I also think there’s a possibility that current models could achieve AGI. Maybe it’s only a 4% chance, but again, why take the risk? what is there to gain (other than money)?
I don’t understand how publicly releasing these models will help in researching AI safety (and when I say “AI safety” I mostly mean AGI alignment). I thought the whole point of an aligned AGI is that you don’t have to tell it to do stuff correctly, it already knows what’s correct, even more than you, so I don’t see how letting anyone use the models will help in aligning them. I’m not an AI expert or anything, but to me it seems aligning AGI is less of a “we don’t have enough data” problem and more of a “we don’t even know where to start” problem.
We don’t know how to align a possible AGI yet. The best we can hope for is that current models are close enough to whatever AGI is going to be, that trying to align them will teach us about aligning an AGI. This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
In principle I agree.
But would you say that people’s suitability to align AI safely (or more specifically ensuring that Fable does not write nasty software exploits) is defined less by their expertise and alignment with Anthropic’s stated mission and more by how much money they can spend on credits?
Because that’s what Anthropic and the impending IPO marketing is asking you to believe
(tbh I’m not concerned by Fable manipulating its way into world domination. But if I was, I’d be extremely concerned that our most dedicated defenders against manipulative AI agents might be the sort of people who still take statements put out by AI companies at face value)
Why? I would find an AI expert is much more suited to align a potential AGI than any common person. I just don’t see how the common person could contribute to alignment. If anything, I can see how they would contribute to DISalignment (engineering better jailbreaks, using the models for nefarious purposes, giving the models “bad values” (like “cause as much damage as possible”), etc.). I think I value existential risk above all else, and I can’t imagine publicly releasing “almost superhuman” models can decrease it.
But you’re not claiming that the models should only be shared with AI researchers. You’re claiming they should only be shared with AI researchers specifically employed by Anthropic.
Although no, I disagree that the input from non-AI-researchers is useless here—as you need to hear both from the end users and from people affected by AI and its decisions.
I’m thinking more of the “endgame” here, so I think the input from non-researchers is no more valuable than the input of the researchers (as in, any useful information you could obtain about AI safety can be obtained just from the researchers alone). To be specific, I believe something along the lines of AI 2027 is gonna be the somewhat-near future, so I wanna restrict access to advanced models as much as possible.
Think of it like nuclear bombs. If you had a technology that powerful, you wouldn’t want to risk any bad actors getting access to it, so you limit the amount of owners as much as possible. It would be pretty ridiculous to want private companies to be able to own, or even use nuclear weapons, and I think the case is pretty similar for current and future AI.