Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the firsttry(with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous.
Right. And that’s where the whole problem lies! If we can’t meaningfullyalign today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.
Thanks. Yes you’re right in that I’m saying that you specifically need to apply security mindset/Murphy’s Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!
Yes, I’m saying 1. This is enough for doom by default!
2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It’s not anything close to a 50⁄50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!
Ah thanks Greg! That’s very helpful.
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the first try (with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Right. And that’s where the whole problem lies! If we can’t meaningfully align today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.