[Strong Disagree.] I think “anything that can go wrong will go wrong” becomes stronger and stronger, the bigger the intelligence gap there is between you and an AI you are trying to align. For it not to apply requires a mechanism for the AI to spontaneously become perfectly aligned. What is that mechanism?
Given that it’s much easier to argue that the risk of catastrophe is unacceptably high (say >10%), and this has the same practical implications, I’d suggest that you argue for that instead of 90%
It does not have the same practical implications. As I say in the post, there is a big difference between the two in terms of it becoming a “suicide race” for the latter (90%). Arguing the former (10%) - as many people are already doing, and have been for years—has not moved the needle (it seems as though OpenAI, Google Deepmind and Anthropic are basically fine with gambling tens to hundred of millions of lives on a shot at utopia.[1]
Regarding the whether they have the same practical implications, I guess I agree that if everyone had a 90% credence in catastrophe, that would be better than them having 50% credence or 10%.
Inasmuch as you’re right that the major players have a 10% credence of catastrophe, we should either push to raise it or to advocate for more caution given the stakes.
My worry is that they don’t actually have that 10% credence, despite maybe saying they do, and that coming across as more extreme might stop them from listening.
I think you might be right that if we can make the case for 90%, we should make it. But I worry we can’t.
I thought you were invoking “Murphy’s Law” as a general principle that should generally be relied upon—I thought you were saying that in general, a security mindset should be used.
But I think you’re saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy’s Law to hold.
Here are three things I think you might be trying to say:
As AI systems get more and more powerful, if there are any problems with your technical setup (training procedures, oversight procedures, etc.), then if those AI systems are misaligned, they will be sure to exploit those vulnerabilities.
Any training setup that could lead to misaligned AI will lead to misaligned AI. That is, unless your technical setup for creating AI is watertight, then it is highly likely that you end up with misaligned AI.
Unless the societal process you use to decide what technical setup gets used to create AGI is watertight, then it’s very likely to choose a technical setup that will lead to misaligned AI.
I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.
But I don’t understand the reason to think that 2 and especially 3 are both true. That’s why I’m not confident in catastrophe: I think it’s plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn’t watertight.
You seem to think that it’s very likely that we won’t end up with a good enough training setup, but I don’t understand why.
Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the firsttry(with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous.
Right. And that’s where the whole problem lies! If we can’t meaningfullyalign today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.
[Strong Disagree.] I think “anything that can go wrong will go wrong” becomes stronger and stronger, the bigger the intelligence gap there is between you and an AI you are trying to align. For it not to apply requires a mechanism for the AI to spontaneously become perfectly aligned. What is that mechanism?
It does not have the same practical implications. As I say in the post, there is a big difference between the two in terms of it becoming a “suicide race” for the latter (90%). Arguing the former (10%) - as many people are already doing, and have been for years—has not moved the needle (it seems as though OpenAI, Google Deepmind and Anthropic are basically fine with gambling tens to hundred of millions of lives on a shot at utopia.[1]
To be clear—I’m not saying it’s 90% for the sake of argument. It is what I actually believe.
Regarding the whether they have the same practical implications, I guess I agree that if everyone had a 90% credence in catastrophe, that would be better than them having 50% credence or 10%.
Inasmuch as you’re right that the major players have a 10% credence of catastrophe, we should either push to raise it or to advocate for more caution given the stakes.
My worry is that they don’t actually have that 10% credence, despite maybe saying they do, and that coming across as more extreme might stop them from listening.
I think you might be right that if we can make the case for 90%, we should make it. But I worry we can’t.
I think we should at least try! (As I am doing here.)
Ah I think I see the misunderstanding.
I thought you were invoking “Murphy’s Law” as a general principle that should generally be relied upon—I thought you were saying that in general, a security mindset should be used.
But I think you’re saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy’s Law to hold.
Here are three things I think you might be trying to say:
As AI systems get more and more powerful, if there are any problems with your technical setup (training procedures, oversight procedures, etc.), then if those AI systems are misaligned, they will be sure to exploit those vulnerabilities.
Any training setup that could lead to misaligned AI will lead to misaligned AI. That is, unless your technical setup for creating AI is watertight, then it is highly likely that you end up with misaligned AI.
Unless the societal process you use to decide what technical setup gets used to create AGI is watertight, then it’s very likely to choose a technical setup that will lead to misaligned AI.
I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.
But I don’t understand the reason to think that 2 and especially 3 are both true. That’s why I’m not confident in catastrophe: I think it’s plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn’t watertight.
You seem to think that it’s very likely that we won’t end up with a good enough training setup, but I don’t understand why.
Looking forward to your reply!
Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
Ah thanks Greg! That’s very helpful.
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the first try (with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Right. And that’s where the whole problem lies! If we can’t meaningfully align today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.