I thought you were invoking “Murphy’s Law” as a general principle that should generally be relied upon—I thought you were saying that in general, a security mindset should be used.
But I think you’re saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy’s Law to hold.
Here are three things I think you might be trying to say:
As AI systems get more and more powerful, if there are any problems with your technical setup (training procedures, oversight procedures, etc.), then if those AI systems are misaligned, they will be sure to exploit those vulnerabilities.
Any training setup that could lead to misaligned AI will lead to misaligned AI. That is, unless your technical setup for creating AI is watertight, then it is highly likely that you end up with misaligned AI.
Unless the societal process you use to decide what technical setup gets used to create AGI is watertight, then it’s very likely to choose a technical setup that will lead to misaligned AI.
I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.
But I don’t understand the reason to think that 2 and especially 3 are both true. That’s why I’m not confident in catastrophe: I think it’s plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn’t watertight.
You seem to think that it’s very likely that we won’t end up with a good enough training setup, but I don’t understand why.
Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the firsttry(with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous.
Right. And that’s where the whole problem lies! If we can’t meaningfullyalign today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.
Ah I think I see the misunderstanding.
I thought you were invoking “Murphy’s Law” as a general principle that should generally be relied upon—I thought you were saying that in general, a security mindset should be used.
But I think you’re saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy’s Law to hold.
Here are three things I think you might be trying to say:
As AI systems get more and more powerful, if there are any problems with your technical setup (training procedures, oversight procedures, etc.), then if those AI systems are misaligned, they will be sure to exploit those vulnerabilities.
Any training setup that could lead to misaligned AI will lead to misaligned AI. That is, unless your technical setup for creating AI is watertight, then it is highly likely that you end up with misaligned AI.
Unless the societal process you use to decide what technical setup gets used to create AGI is watertight, then it’s very likely to choose a technical setup that will lead to misaligned AI.
I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.
But I don’t understand the reason to think that 2 and especially 3 are both true. That’s why I’m not confident in catastrophe: I think it’s plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn’t watertight.
You seem to think that it’s very likely that we won’t end up with a good enough training setup, but I don’t understand why.
Looking forward to your reply!
Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
Ah thanks Greg! That’s very helpful.
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the first try (with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Right. And that’s where the whole problem lies! If we can’t meaningfully align today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.