Thanks for the great summary! A few questions about it
1. You call mesa-optimization “the best current case for AI risk”. As Ben noted at the time of the interview, this argument hasn’t yet really been fleshed out in detail. And as Rohin subsequently wrote in his opinion of the mesa-optimization paper, “it is not yet clear whether mesa optimizers will actually arise in practice”. Do you have thoughts on what exactly the “Argument for AI Risk from Mesa-Optimization” is, and/or a pointer to the places where, in your opinion, that argument has been made (aside from the original paper)?
2. I don’t entirely understand the remark about the reference class of ‘new intelligent species’. What species are in that reference class? Many species which we regard as quite intelligent (orangutans, octopuses, New Caledonian crows) aren’t risky. Probably, you mean a reference class like “new species as smart as humans” or “new ‘generally intelligent’ species”. But then we have a very small reference class and it’s hard to know how strong that prior should be. In any case, how were you thinking of this reference class argument?
3. ‘The Boss Baby’, starring Alec Baldwin, is available for rental on Amazon Prime Video for $3.99. I suppose this is more of a comment than a question.
1. Oh man, I wish. :( I do think there are some people working on making a crisper case, and hopefully as machine learning systems get more powerful we might even see early demonstrations. I think the crispest statement of it I can make is “Similar to how humans are now optimizing for goals that are not just the genetic fitness evolution wants, other systems which contain optimizers may start optimizing for goals other than the ones specified by the outer optimizer.”
Another related concept that I’ve seen (but haven’t followed up on) is what johnswentworth calls “Demons in Imperfect Search”, which basically advocates for the possibility of runaway inner processes in a variety of imperfect search spaces (not just ones that have inner optimizers). This arguably happened with metabolic reactions early in the development of life, greedy genes, managers in companies. Basically, I’m convinced that we don’t know enough about how powerful search mechanisms work to be sure that we’re going to end up somewhere we want.
I should also say that I think these kinds of arguments feel like the best current cases for AI alignment risk. Even if AI systems end up perfectly aligned with human goals, I’m still quite worried about what the balance of power looks like in a world with lots of extremely powerful AIs running around.
2. Yeah, here I should have said ‘new species more intelligent than us’. I think I was thinking of two things here:
Humans causing the extinction of less intelligent species
Some folk intuition around intelligent aliens plausibly causing human extinction (I admit this isn’t the best example...).
Mostly I meant here that since we don’t actually have examples of existentially risky technology (yet), putting AI in the reference class of ‘new technology’ might make you think it’s extremely implausible that it would be existentially bad. But we do have examples of species causing the extinction of lesser species (and scarier intuitions around it), so in the sense that AI is a new, more intelligent species, we should think there’s at least some chance that it could be existentially bad.
3. Obviously not the same thing, but ‘The Boss Baby: Back in Business’, a spin-off of the original, not starring Alec Baldwin, is available on Netflix right now. I’ve watched about 20 seconds of it and feel comfortable saying that the money would be better spent on AI safety and governance work.
Thanks for the great summary! A few questions about it
1. You call mesa-optimization “the best current case for AI risk”. As Ben noted at the time of the interview, this argument hasn’t yet really been fleshed out in detail. And as Rohin subsequently wrote in his opinion of the mesa-optimization paper, “it is not yet clear whether mesa optimizers will actually arise in practice”. Do you have thoughts on what exactly the “Argument for AI Risk from Mesa-Optimization” is, and/or a pointer to the places where, in your opinion, that argument has been made (aside from the original paper)?
2. I don’t entirely understand the remark about the reference class of ‘new intelligent species’. What species are in that reference class? Many species which we regard as quite intelligent (orangutans, octopuses, New Caledonian crows) aren’t risky. Probably, you mean a reference class like “new species as smart as humans” or “new ‘generally intelligent’ species”. But then we have a very small reference class and it’s hard to know how strong that prior should be. In any case, how were you thinking of this reference class argument?
3. ‘The Boss Baby’, starring Alec Baldwin, is available for rental on Amazon Prime Video for $3.99. I suppose this is more of a comment than a question.
1. Oh man, I wish. :( I do think there are some people working on making a crisper case, and hopefully as machine learning systems get more powerful we might even see early demonstrations. I think the crispest statement of it I can make is “Similar to how humans are now optimizing for goals that are not just the genetic fitness evolution wants, other systems which contain optimizers may start optimizing for goals other than the ones specified by the outer optimizer.”
Another related concept that I’ve seen (but haven’t followed up on) is what johnswentworth calls “Demons in Imperfect Search”, which basically advocates for the possibility of runaway inner processes in a variety of imperfect search spaces (not just ones that have inner optimizers). This arguably happened with metabolic reactions early in the development of life, greedy genes, managers in companies. Basically, I’m convinced that we don’t know enough about how powerful search mechanisms work to be sure that we’re going to end up somewhere we want.
I should also say that I think these kinds of arguments feel like the best current cases for AI alignment risk. Even if AI systems end up perfectly aligned with human goals, I’m still quite worried about what the balance of power looks like in a world with lots of extremely powerful AIs running around.
2. Yeah, here I should have said ‘new species more intelligent than us’. I think I was thinking of two things here:
Humans causing the extinction of less intelligent species
Some folk intuition around intelligent aliens plausibly causing human extinction (I admit this isn’t the best example...).
Mostly I meant here that since we don’t actually have examples of existentially risky technology (yet), putting AI in the reference class of ‘new technology’ might make you think it’s extremely implausible that it would be existentially bad. But we do have examples of species causing the extinction of lesser species (and scarier intuitions around it), so in the sense that AI is a new, more intelligent species, we should think there’s at least some chance that it could be existentially bad.
3. Obviously not the same thing, but ‘The Boss Baby: Back in Business’, a spin-off of the original, not starring Alec Baldwin, is available on Netflix right now. I’ve watched about 20 seconds of it and feel comfortable saying that the money would be better spent on AI safety and governance work.