Just a quick reply (I might reply more in-depth later but this is possibly the most important point):
I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by “unaligned AI”? It seems quite plausible that I will agree with your position, and think it amounts to something like “we were pretty successful with alignment”.
In my post I talked about the “default” alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn’t train their AIs to exhibit human-like behaviors, or to be human-compatible?
It’s possible my language was misleading by giving an image of what unaligned AI looks like that isn’t actually a realistic “default” in any scenario. But when I talk about unaligned AI, I’m simply talking about AI that doesn’t share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don’t share the exact preferences of their employer (otherwise they’d have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.
This is largely the picture I think we should imagine when we think about the “default” unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.
(As an aside, I thought this distinction wasn’t worth making because I thought most readers would have already strongly internalized the idea that RLHF isn’t “real alignment work”. I suspect I was mistaken, and probably confused a ton of people.)
Just a quick reply (I might reply more in-depth later but this is possibly the most important point):
In my post I talked about the “default” alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn’t train their AIs to exhibit human-like behaviors, or to be human-compatible?
It’s possible my language was misleading by giving an image of what unaligned AI looks like that isn’t actually a realistic “default” in any scenario. But when I talk about unaligned AI, I’m simply talking about AI that doesn’t share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don’t share the exact preferences of their employer (otherwise they’d have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.
This is largely the picture I think we should imagine when we think about the “default” unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.
(As an aside, I thought this distinction wasn’t worth making because I thought most readers would have already strongly internalized the idea that RLHF isn’t “real alignment work”. I suspect I was mistaken, and probably confused a ton of people.)