I had a question. Why do all the AI safety companies seem to do the opposite of AI safety? Anthropic keeps publicly releasing models (which means they can be accessed by billions of people), same for OpenAI, and while these models are unlikely to cause major problems, if you’re releasing a product that is going to be used by billions of people you should make sure the product is around 99.9999% failure proof. Anthropic themselves have said “AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities” when referring to Mythos. Now sure, Fable is claimed to be “safe for general use”, and maybe it is, but why take the risk? Especially after only around 2-3 months of safety testing? I would want a company that claims to be for AI safety to always err on the side of caution, but this frankly seems quite reckless.
I still maintain that publicly releasing models is the correct way to get any chance of good alignment research—you can’t possibly believe that the researchers at Anthropic alone are enough to tackle the problem. It’s a global problem and should have the opportunity for the global population to solve it.
I don’t know, maybe eventually it could help, but with these “cutting edge” coding models doesn’t it seem irresponsible? what if the safeguards don’t work? shouldn’t you release the model publicly only after you’ve exhaustively patched every single possible jailbreak? (even then I would argue it’s still better to not release it, since billions of people means hundreds of thousands of bad actors, and again, as an AI safety company with “cutting edge” models I wouldn’t take any risks)
How can you “solve every possible jailbreak”? And is it worth it crippling large-scale research into safeguarding from future AI because of fears about what the current models might be capable of?
(My own answer is “maybe”. It depends on how bad you think current models are for society—pretty bad in my opinion—vs. how likely you think it is an existentially-threatening AI will actually be born out of the current efforts).
You can’t solve every possible jailbreak, but you should solve every jailbreak humanly possible if you’re to release an AI that is claimed to be almost superhuman at cyber skills. I think current models are mostly bad for society, but I also think there’s a possibility that current models could achieve AGI. Maybe it’s only a 4% chance, but again, why take the risk? what is there to gain (other than money)?
I don’t understand how publicly releasing these models will help in researching AI safety (and when I say “AI safety” I mostly mean AGI alignment). I thought the whole point of an aligned AGI is that you don’t have to tell it to do stuff correctly, it already knows what’s correct, even more than you, so I don’t see how letting anyone use the models will help in aligning them. I’m not an AI expert or anything, but to me it seems aligning AGI is less of a “we don’t have enough data” problem and more of a “we don’t even know where to start” problem.
We don’t know how to align a possible AGI yet. The best we can hope for is that current models are close enough to whatever AGI is going to be, that trying to align them will teach us about aligning an AGI. This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies
In principle I agree.
But would you say that people’s suitability to align AI safely (or more specifically ensuring that Fable does not write nasty software exploits) is defined less by their expertise and alignment with Anthropic’s stated mission and more by how much money they can spend on credits?
Because that’s what Anthropic and the impending IPO marketing is asking you to believe
(tbh I’m not concerned by Fable manipulating its way into world domination. But if I was, I’d be extremely concerned that our most dedicated defenders against manipulative AI agents might be the sort of people who still take statements put out by AI companies at face value)
This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
Why? I would find an AI expert is much more suited to align a potential AGI than any common person. I just don’t see how the common person could contribute to alignment. If anything, I can see how they would contribute to DISalignment (engineering better jailbreaks, using the models for nefarious purposes, giving the models “bad values” (like “cause as much damage as possible”), etc.). I think I value existential risk above all else, and I can’t imagine publicly releasing “almost superhuman” models can decrease it.
But you’re not claiming that the models should only be shared with AI researchers. You’re claiming they should only be shared with AI researchers specifically employed by Anthropic.
Although no, I disagree that the input from non-AI-researchers is useless here—as you need to hear both from the end users and from people affected by AI and its decisions.
I’m thinking more of the “endgame” here, so I think the input from non-researchers is no more valuable than the input of the researchers (as in, any useful information you could obtain about AI safety can be obtained just from the researchers alone). To be specific, I believe something along the lines of AI 2027 is gonna be the somewhat-near future, so I wanna restrict access to advanced models as much as possible.
Think of it like nuclear bombs. If you had a technology that powerful, you wouldn’t want to risk any bad actors getting access to it, so you limit the amount of owners as much as possible. It would be pretty ridiculous to want private companies to be able to own, or even use nuclear weapons, and I think the case is pretty similar for current and future AI.
Anthropic and OpenAI have been quite successful at promoting the view that their models are very capable or even dangerous. By exaggerating the risks of these models they create a reputation as morally conscious, safety-oriented companies that nevertheless have the best models that everyone should be aware of. This gets the sympathies of investors and AI safety people alike.
But of course they must sell their models to get their money back.
So you’re saying it’s all BS? You’re saying that Anthropic and OpenAI ultimately don’t care about AI alignment? It sure seems like it for me, but browsing this website I have the feeling most people disagree with you
There is a limit to how much a for-profit company can care about safety. It is of existential importance to make profit, so you have to sell your product. It’s just how it is.
OpenAI and Anthropic are technically public benefit companies that have other objectives than just profit, but to my understanding they are formulated in a way that makes them not relevant currently, so profit is more important.
I don’t say that there aren’t people in these companies that care about safety, or that safety is completely ignored when developing products, just that there is a limit to how far the companies will consider safety.
I understand that, but my point is that I thought these were AI safety companies, and would therefore prioritize AI safety above all else. If they don’t, why do so many people still treat them as if they did?
I had a question. Why do all the AI safety companies seem to do the opposite of AI safety? Anthropic keeps publicly releasing models (which means they can be accessed by billions of people), same for OpenAI, and while these models are unlikely to cause major problems, if you’re releasing a product that is going to be used by billions of people you should make sure the product is around 99.9999% failure proof. Anthropic themselves have said “AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities” when referring to Mythos. Now sure, Fable is claimed to be “safe for general use”, and maybe it is, but why take the risk? Especially after only around 2-3 months of safety testing? I would want a company that claims to be for AI safety to always err on the side of caution, but this frankly seems quite reckless.
I still maintain that publicly releasing models is the correct way to get any chance of good alignment research—you can’t possibly believe that the researchers at Anthropic alone are enough to tackle the problem. It’s a global problem and should have the opportunity for the global population to solve it.
I don’t know, maybe eventually it could help, but with these “cutting edge” coding models doesn’t it seem irresponsible? what if the safeguards don’t work? shouldn’t you release the model publicly only after you’ve exhaustively patched every single possible jailbreak? (even then I would argue it’s still better to not release it, since billions of people means hundreds of thousands of bad actors, and again, as an AI safety company with “cutting edge” models I wouldn’t take any risks)
How can you “solve every possible jailbreak”? And is it worth it crippling large-scale research into safeguarding from future AI because of fears about what the current models might be capable of?
(My own answer is “maybe”. It depends on how bad you think current models are for society—pretty bad in my opinion—vs. how likely you think it is an existentially-threatening AI will actually be born out of the current efforts).
You can’t solve every possible jailbreak, but you should solve every jailbreak humanly possible if you’re to release an AI that is claimed to be almost superhuman at cyber skills. I think current models are mostly bad for society, but I also think there’s a possibility that current models could achieve AGI. Maybe it’s only a 4% chance, but again, why take the risk? what is there to gain (other than money)?
I don’t understand how publicly releasing these models will help in researching AI safety (and when I say “AI safety” I mostly mean AGI alignment). I thought the whole point of an aligned AGI is that you don’t have to tell it to do stuff correctly, it already knows what’s correct, even more than you, so I don’t see how letting anyone use the models will help in aligning them. I’m not an AI expert or anything, but to me it seems aligning AGI is less of a “we don’t have enough data” problem and more of a “we don’t even know where to start” problem.
We don’t know how to align a possible AGI yet. The best we can hope for is that current models are close enough to whatever AGI is going to be, that trying to align them will teach us about aligning an AGI. This task, of trying to align them, is something that shouldn’t just be left to researchers in AI companies.
In principle I agree.
But would you say that people’s suitability to align AI safely (or more specifically ensuring that Fable does not write nasty software exploits) is defined less by their expertise and alignment with Anthropic’s stated mission and more by how much money they can spend on credits?
Because that’s what Anthropic and the impending IPO marketing is asking you to believe
(tbh I’m not concerned by Fable manipulating its way into world domination. But if I was, I’d be extremely concerned that our most dedicated defenders against manipulative AI agents might be the sort of people who still take statements put out by AI companies at face value)
Why? I would find an AI expert is much more suited to align a potential AGI than any common person. I just don’t see how the common person could contribute to alignment. If anything, I can see how they would contribute to DISalignment (engineering better jailbreaks, using the models for nefarious purposes, giving the models “bad values” (like “cause as much damage as possible”), etc.). I think I value existential risk above all else, and I can’t imagine publicly releasing “almost superhuman” models can decrease it.
But you’re not claiming that the models should only be shared with AI researchers. You’re claiming they should only be shared with AI researchers specifically employed by Anthropic.
Although no, I disagree that the input from non-AI-researchers is useless here—as you need to hear both from the end users and from people affected by AI and its decisions.
I’m thinking more of the “endgame” here, so I think the input from non-researchers is no more valuable than the input of the researchers (as in, any useful information you could obtain about AI safety can be obtained just from the researchers alone). To be specific, I believe something along the lines of AI 2027 is gonna be the somewhat-near future, so I wanna restrict access to advanced models as much as possible.
Think of it like nuclear bombs. If you had a technology that powerful, you wouldn’t want to risk any bad actors getting access to it, so you limit the amount of owners as much as possible. It would be pretty ridiculous to want private companies to be able to own, or even use nuclear weapons, and I think the case is pretty similar for current and future AI.
Who’s going to tell him?
Am I missing something obvious? Don’t Anthropic and OpenAI claim to be for AI safety research?
That’s marketing.
Anthropic and OpenAI have been quite successful at promoting the view that their models are very capable or even dangerous. By exaggerating the risks of these models they create a reputation as morally conscious, safety-oriented companies that nevertheless have the best models that everyone should be aware of. This gets the sympathies of investors and AI safety people alike.
But of course they must sell their models to get their money back.
So you’re saying it’s all BS? You’re saying that Anthropic and OpenAI ultimately don’t care about AI alignment? It sure seems like it for me, but browsing this website I have the feeling most people disagree with you
There is a limit to how much a for-profit company can care about safety. It is of existential importance to make profit, so you have to sell your product. It’s just how it is.
OpenAI and Anthropic are technically public benefit companies that have other objectives than just profit, but to my understanding they are formulated in a way that makes them not relevant currently, so profit is more important.
I don’t say that there aren’t people in these companies that care about safety, or that safety is completely ignored when developing products, just that there is a limit to how far the companies will consider safety.
I understand that, but my point is that I thought these were AI safety companies, and would therefore prioritize AI safety above all else. If they don’t, why do so many people still treat them as if they did?