I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the firsttry(with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous.
Right. And that’s where the whole problem lies! If we can’t meaningfullyalign today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.
Ah thanks Greg! That’s very helpful.
I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.
But luckily we aren’t picking at random: we’re deliberately trying to aim for that target, which makes me much more optimistic about hitting it.
And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it’s misaligned at the start of training, but what’s missing initially are the capabilities that make that misalignment dangerous. Put another way, it’s acceptable for early systems to be misaligned, because they can’t cause an existential catastrophe. It’s only by the time a system could take power if it tried that it’s essential it doesn’t want to.
These two reasons make me much less sure that catastrophe is likely. It’s still a very live possibility, but these reasons for optimism make me feel more like “unsure” than “confident of catastrophe”.
We might be deliberately aiming, but we have to get it right on the first try (with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to “29%” in OP!)
Right. And that’s where the whole problem lies! If we can’t meaningfully align today’s weak AI systems, what hope do we have for aligning much more powerful ones!? It’s not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say “it’s ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5”, are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.