Planned summary of the podcast episode for the Alignment Newsletter:
In this podcast, Ben Garfinkel goes through several reasons why he is skeptical of classic AI risk arguments (some previously discussed <@here@>(@How Sure are we about this AI Stuff?@)). The podcast has considerably more detail and nuance than this summary.
Ben thinks that historically, it has been hard to affect transformative technologies in a way that was foreseeably good for the long-term—it’s hard e.g. to see what you could have done around the development of agriculture or industrialization that would have an impact on the world today. He thinks some potential avenues for long-term influence could be through addressing increased political instability or the possibility of lock-in, though he thinks that it’s unclear what we could do today to influence the outcome of a lock-in, especially if it’s far away.
In terms of alignment, Ben focuses on the standard set of arguments outlined in Nick Bostrom’s Superintelligence, because they are broadly influential and relatively fleshed out. Ben has several objections to these arguments:
- He thinks it isn’t likely that there will be a sudden jump to extremely powerful and dangerous AI systems, and he thinks we have a much better chance of correcting problems as they come up if capabilities grow gradually.
- He thinks that making AI systems capable and making AI systems have the right goals are likely to go together.
- He thinks that just because there are many ways to create a system that behaves destructively doesn’t mean that the engineering process creating that system is likely to be attracted to those destructive systems; it seems like we are unlikely to accidentally create systems that are destructive enough to end humanity.
Ben also spends a little time discussing <@mesa-optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@), a much newer argument for AI risk. He largely thinks that the case for mesa-optimization hasn’t yet been fleshed out sufficiently. He also thinks it’s plausible that learning incorrect goals may be a result of having systems that are insufficiently sophisticated to represent goals appropriately. With sufficient training, we may in fact converge to the system we want.
Given the current state of argumentation, Ben thinks that it’s worth EA time to flesh out newer arguments around AI risk, but also thinks that EAs who don’t have a comparative advantage in AI-related topics shouldn’t necessarily switch into AI. Ben thinks it’s a moral outrage that we have spent less money on AI safety and governance than the 2017 movie ‘The Boss Baby’, starring Alec Baldwin.
Planned opinion:
This podcast covers a really impressive breadth of the existing argumentation. A lot of the reasoning is similar to <@that I’ve heard from other researchers@>(@Takeaways from safety by default interviews@). I’m really glad that Ben and others are spending time critiquing these arguments; in addition to showing us where we’re wrong, it helps us steer towards more plausible risky scenarios.
I largely agree with Ben’s criticisms of the Bostrom AI model; I think mesa-optimization is the best current case for AI risk and am excited to see more work on it. The parts of the podcast where I most disagreed with Ben were:
- I think even in the absence of solid argumentation, I feel good about a prior where AI has a non-trivial chance of being existentially threatening, partially because I think it’s reasonable to put AI in the reference class of ‘new intelligent species’ in addition to ‘new technology’.
I recommend listening to the full podcast, as it contains a lot of detail that wouldn’t fit in this summary. Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals. If current ML systems don’t lead to goal-directed behavior, I expect that proponents of the classic arguments would say that they also won’t lead to superintelligent AI systems. I’m not particularly sold on this intuition either, but I can see its appeal.
Thanks for the great summary! A few questions about it
1. You call mesa-optimization “the best current case for AI risk”. As Ben noted at the time of the interview, this argument hasn’t yet really been fleshed out in detail. And as Rohin subsequently wrote in his opinion of the mesa-optimization paper, “it is not yet clear whether mesa optimizers will actually arise in practice”. Do you have thoughts on what exactly the “Argument for AI Risk from Mesa-Optimization” is, and/or a pointer to the places where, in your opinion, that argument has been made (aside from the original paper)?
2. I don’t entirely understand the remark about the reference class of ‘new intelligent species’. What species are in that reference class? Many species which we regard as quite intelligent (orangutans, octopuses, New Caledonian crows) aren’t risky. Probably, you mean a reference class like “new species as smart as humans” or “new ‘generally intelligent’ species”. But then we have a very small reference class and it’s hard to know how strong that prior should be. In any case, how were you thinking of this reference class argument?
3. ‘The Boss Baby’, starring Alec Baldwin, is available for rental on Amazon Prime Video for $3.99. I suppose this is more of a comment than a question.
1. Oh man, I wish. :( I do think there are some people working on making a crisper case, and hopefully as machine learning systems get more powerful we might even see early demonstrations. I think the crispest statement of it I can make is “Similar to how humans are now optimizing for goals that are not just the genetic fitness evolution wants, other systems which contain optimizers may start optimizing for goals other than the ones specified by the outer optimizer.”
Another related concept that I’ve seen (but haven’t followed up on) is what johnswentworth calls “Demons in Imperfect Search”, which basically advocates for the possibility of runaway inner processes in a variety of imperfect search spaces (not just ones that have inner optimizers). This arguably happened with metabolic reactions early in the development of life, greedy genes, managers in companies. Basically, I’m convinced that we don’t know enough about how powerful search mechanisms work to be sure that we’re going to end up somewhere we want.
I should also say that I think these kinds of arguments feel like the best current cases for AI alignment risk. Even if AI systems end up perfectly aligned with human goals, I’m still quite worried about what the balance of power looks like in a world with lots of extremely powerful AIs running around.
2. Yeah, here I should have said ‘new species more intelligent than us’. I think I was thinking of two things here:
Humans causing the extinction of less intelligent species
Some folk intuition around intelligent aliens plausibly causing human extinction (I admit this isn’t the best example...).
Mostly I meant here that since we don’t actually have examples of existentially risky technology (yet), putting AI in the reference class of ‘new technology’ might make you think it’s extremely implausible that it would be existentially bad. But we do have examples of species causing the extinction of lesser species (and scarier intuitions around it), so in the sense that AI is a new, more intelligent species, we should think there’s at least some chance that it could be existentially bad.
3. Obviously not the same thing, but ‘The Boss Baby: Back in Business’, a spin-off of the original, not starring Alec Baldwin, is available on Netflix right now. I’ve watched about 20 seconds of it and feel comfortable saying that the money would be better spent on AI safety and governance work.
Planned summary of the podcast episode for the Alignment Newsletter:
Planned opinion:
Rohin’s opinion:
Thanks for the great summary! A few questions about it
1. You call mesa-optimization “the best current case for AI risk”. As Ben noted at the time of the interview, this argument hasn’t yet really been fleshed out in detail. And as Rohin subsequently wrote in his opinion of the mesa-optimization paper, “it is not yet clear whether mesa optimizers will actually arise in practice”. Do you have thoughts on what exactly the “Argument for AI Risk from Mesa-Optimization” is, and/or a pointer to the places where, in your opinion, that argument has been made (aside from the original paper)?
2. I don’t entirely understand the remark about the reference class of ‘new intelligent species’. What species are in that reference class? Many species which we regard as quite intelligent (orangutans, octopuses, New Caledonian crows) aren’t risky. Probably, you mean a reference class like “new species as smart as humans” or “new ‘generally intelligent’ species”. But then we have a very small reference class and it’s hard to know how strong that prior should be. In any case, how were you thinking of this reference class argument?
3. ‘The Boss Baby’, starring Alec Baldwin, is available for rental on Amazon Prime Video for $3.99. I suppose this is more of a comment than a question.
1. Oh man, I wish. :( I do think there are some people working on making a crisper case, and hopefully as machine learning systems get more powerful we might even see early demonstrations. I think the crispest statement of it I can make is “Similar to how humans are now optimizing for goals that are not just the genetic fitness evolution wants, other systems which contain optimizers may start optimizing for goals other than the ones specified by the outer optimizer.”
Another related concept that I’ve seen (but haven’t followed up on) is what johnswentworth calls “Demons in Imperfect Search”, which basically advocates for the possibility of runaway inner processes in a variety of imperfect search spaces (not just ones that have inner optimizers). This arguably happened with metabolic reactions early in the development of life, greedy genes, managers in companies. Basically, I’m convinced that we don’t know enough about how powerful search mechanisms work to be sure that we’re going to end up somewhere we want.
I should also say that I think these kinds of arguments feel like the best current cases for AI alignment risk. Even if AI systems end up perfectly aligned with human goals, I’m still quite worried about what the balance of power looks like in a world with lots of extremely powerful AIs running around.
2. Yeah, here I should have said ‘new species more intelligent than us’. I think I was thinking of two things here:
Humans causing the extinction of less intelligent species
Some folk intuition around intelligent aliens plausibly causing human extinction (I admit this isn’t the best example...).
Mostly I meant here that since we don’t actually have examples of existentially risky technology (yet), putting AI in the reference class of ‘new technology’ might make you think it’s extremely implausible that it would be existentially bad. But we do have examples of species causing the extinction of lesser species (and scarier intuitions around it), so in the sense that AI is a new, more intelligent species, we should think there’s at least some chance that it could be existentially bad.
3. Obviously not the same thing, but ‘The Boss Baby: Back in Business’, a spin-off of the original, not starring Alec Baldwin, is available on Netflix right now. I’ve watched about 20 seconds of it and feel comfortable saying that the money would be better spent on AI safety and governance work.