I’m a philosopher who’s switched to working on AI safety full-time. I also know there are at least a few philosophers at Anthropic working on alignment.
I agree that many of these questions are important and that more people should work on them.
But a fair amount of them are discussed in conventional academic philosophy, e.g.:
How to resolve standard debates in decision theory?
Infinite/multiversal/astronomical ethics
Fair distribution of benefits
What is the nature of philosophy?
What constitutes correct philosophical reasoning?
How should an AI aggregate preferences between its users?
What is the nature of normativity?
And these are all difficult, controversial questions.
For each question, you have to read and deeply think about at least 10 papers (and likely many more) to get a good understanding of the question and its current array of candidate answers.
Any attempt to resolve the question would have to grapple with a large number of considerations and points that have previously been made in relation to the question.
Probably, you need to write something at least book-length.
(And it’s very hard to get people to read book-length things.)
In trying to do this, you probably don’t find any answer that you’re really confident in.
I think most philosophers’ view on the questions they study is: ‘It’s really hard. Here’s my best guess.’
Or if they’re confident of something, it’ll be a small point within existing debates (e.g. ‘This particular variant of this view is subject to this fatal objection.’).
And even if you do find an answer you’re confident in, you’ll have a very hard time convincing other philosophers of that answer.
They’ll bring up some point that you hadn’t thought of.
Or they’ll differ from you in their bedrock intuitions, and it’ll be hard for either of you to see any way to argue the other out of their bedrock intuition.
In some cases—like population ethics and decision theory—we have proofs that every possible answer will have some unsavory implication. You have to pick your poison, and different philosophers will make different picks.
And on inductive grounds, I suspect that many other philosophical questions also have no poison-free answers.
Derek Parfit is a good example here.
He spent decades working on On What Matters, trying to settle the questions of ethics and meta-ethics.
He really tried to get other philosophers to agree with him.
But very few do. The general consensus in philosophy is that it’s not a very convincing book.
And I think a large part of the problem is a difference of bedrock intuitions. For example, Bernard Williams simply ‘didn’t have the concept of a normative reason,’ and there was nothing that Parfit could do to change that.
It also seems like there’s not much of an appetite among AI researchers for this kind of work.
If there were, we might see more discussions of On What Matters, or any of the other existing works on these questions.
When I decided to start working on AI, I seriously considered working on the kinds of questions you list. But due to the points above, I chose to do my current work instead.
This reads to me like you’re saying “these problems are hard [so Wei Dai is over-rating the importance of working on them]”, whereas the inference I would make is “these problems are hard, so we need to slow down AI development, otherwise we won’t be able to solve them in time.”
I didn’t meant to imply that Wei Dai was overrating the problems’ importance. I agree they’re very important! I was making the case that they’re also very intractable.
If I thought solving these problems pre-TAI would be a big increase to the EV of the future, I’d take their difficulty to be a(nother) reason to slow down AI development. But I think I’m more optimistic than you and Wei Dai about waiting until we have smart AIs to help us on these problems.
I said a little in another thread. If we get aligned AI, I think it’ll likely be a corrigible assistant that doesn’t have its own philosophical views that it wants to act on. And then we can use these assistants to help us solve philosophical problems. I’m imagining in particular that these AIs could be very good at mapping logical space, tracing all the implications of various views, etc. So you could ask a question and receive a response like: ‘Here are the different views on this question. Here’s why they’re mutually exclusive and jointly exhaustive. Here are all the most serious objections to each view. Here are all the responses to those objections. Here are all the objections to those responses,’ and so on. That would be a huge boost to philosophical progress. Progress has been slow so far because human philosophers take entire lifetimes just to fill in one small part of this enormous map, and because humans make errors so later philosophers can’t even trust that small filled-in part, and because verification in philosophy isn’t much quicker than generation.
The argument tree (arguments, counterarguments, counter-counterarguments, and so on) is exponentially sized and we don’t know how deep or wide we need to expand it, before some problem can be solved. We do know that different humans looking at the same partial tree (i.e., philosophers who have read the same literature on some problem) can have very different judgments as to what the correct conclusion is. There’s also a huge amount of intuition/judgment involved in choosing which part of the tree to focus on or expand further. With AIs helping to expand the tree for us, there are potential advantages like you mentioned, but also potential disadvantages, like AIs not having good intuition/judgment about what lines of arguments to pursue, or the argument tree (or AI-generated philosophical literature) becoming too large for any humans to read and think about in a relevant time frame. Many will be very tempted to just let AIs answer the questions / make the final conclusions for us, especially if AIs also accelerate technological progress, creating many urgent philosophical problems related to how to use them safely and beneficially. Or if humans try to make the conclusions, can easily get them wrong despite AI help with expanding the argument tree.
So I think undergoing the AI transition without solving metaphilosophy, or making AIs autonomously competent at philosophy (good at getting correct conclusions by themselves) is enormously risky, even if we have corrigible AIs helping us.
I wrote a post that I think was partly inspired by this discussion. The implication of it here is that I don’t necessarily want philosophers to directly try to solve the many hard philosophical problems relevant to AI alignment/safety (especially given how few of them are in this space or concerned about x-safety), but initially just to try to make them “more legible” to others, including AI researchers, key decision makers, and the public. Hopefully you agree that this is a more sensible position.
try to make them “more legible” to others, including AI researchers, key decision makers, and the public
Yes, I agree this is valuable, though I think it’s valuable mainly because it increases the probability that people use future AIs to solve these problems, rather than because it will make people slow down AI development or try very hard to solve them pre-TAI.
I’m not sure but I think maybe I also have a different view than you on what problems are going to be bottlenecks to AI development. e.g. I think there’s a big chance that the world would steam ahead even if we don’t solve any of the current (non-philosophical) problems in alignment (interpretability, shutdownability, reward hacking, etc.).
I agree that many of the problems on my list are very hard and probably not the highest marginal value work to be doing from an individual perspective. Keep in mind that the list was written 6 years ago, when it was less clear when the AI takeoff would start in earnest, or how many philosophers will become motivated to work on AI safety when AGI became visibly closer. I still had some hope that when the time came, a significant fraction of all philosophers would become self-motivated or would be “called to arms” by a civilization-wide AI safety effort, and would be given sufficient resources including time, so the list was trying to be more comprehensive (listing every philosophical problem that I thought relevant to AI safety) than prioritizing. Unfortunately, the reality is nearly the completely opposite of this.
Currently, one of my main puzzles is why philosophers with public AI x-risk estimates still have numbers in the 10% range, despite reality being near the most pessimistic of my range of expectations, and it looking like that the AI takeoff/transition will occur while most of these philosophical problems will remain in a wide open or totally confused state, and AI researchers seem almost completely oblivious or uncaring about this. Why are they not making the same kind of argument that I’ve been making, that philosophical difficulty is a reason that AI alignment/x-safety is harder than many think, and an additional reason to pause/stop AI?
I don’t think philosophical difficulty is that much of an increase to the difficulty of alignment, mainly because I think that AI developers should (and likely will) aim to make AIs corrigible assistants rather than agents with their own philosophical views that they try to impose on the world. And I think it’s fairly likely that we can use these assistants (if we succeed in getting them and aren’t disempowered by a misaligned AI instead) to help a lot with these hard philosophical questions.
I’m a philosopher who’s switched to working on AI safety full-time. I also know there are at least a few philosophers at Anthropic working on alignment.
With regards to your Problems in AI Alignment that philosophers could potentially contribute to:
I agree that many of these questions are important and that more people should work on them.
But a fair amount of them are discussed in conventional academic philosophy, e.g.:
How to resolve standard debates in decision theory?
Infinite/multiversal/astronomical ethics
Fair distribution of benefits
What is the nature of philosophy?
What constitutes correct philosophical reasoning?
How should an AI aggregate preferences between its users?
What is the nature of normativity?
And these are all difficult, controversial questions.
For each question, you have to read and deeply think about at least 10 papers (and likely many more) to get a good understanding of the question and its current array of candidate answers.
Any attempt to resolve the question would have to grapple with a large number of considerations and points that have previously been made in relation to the question.
Probably, you need to write something at least book-length.
(And it’s very hard to get people to read book-length things.)
In trying to do this, you probably don’t find any answer that you’re really confident in.
I think most philosophers’ view on the questions they study is: ‘It’s really hard. Here’s my best guess.’
Or if they’re confident of something, it’ll be a small point within existing debates (e.g. ‘This particular variant of this view is subject to this fatal objection.’).
And even if you do find an answer you’re confident in, you’ll have a very hard time convincing other philosophers of that answer.
They’ll bring up some point that you hadn’t thought of.
Or they’ll differ from you in their bedrock intuitions, and it’ll be hard for either of you to see any way to argue the other out of their bedrock intuition.
In some cases—like population ethics and decision theory—we have proofs that every possible answer will have some unsavory implication. You have to pick your poison, and different philosophers will make different picks.
And on inductive grounds, I suspect that many other philosophical questions also have no poison-free answers.
Derek Parfit is a good example here.
He spent decades working on On What Matters, trying to settle the questions of ethics and meta-ethics.
He really tried to get other philosophers to agree with him.
But very few do. The general consensus in philosophy is that it’s not a very convincing book.
And I think a large part of the problem is a difference of bedrock intuitions. For example, Bernard Williams simply ‘didn’t have the concept of a normative reason,’ and there was nothing that Parfit could do to change that.
It also seems like there’s not much of an appetite among AI researchers for this kind of work.
If there were, we might see more discussions of On What Matters, or any of the other existing works on these questions.
When I decided to start working on AI, I seriously considered working on the kinds of questions you list. But due to the points above, I chose to do my current work instead.
This reads to me like you’re saying “these problems are hard [so Wei Dai is over-rating the importance of working on them]”, whereas the inference I would make is “these problems are hard, so we need to slow down AI development, otherwise we won’t be able to solve them in time.”
I didn’t meant to imply that Wei Dai was overrating the problems’ importance. I agree they’re very important! I was making the case that they’re also very intractable.
If I thought solving these problems pre-TAI would be a big increase to the EV of the future, I’d take their difficulty to be a(nother) reason to slow down AI development. But I think I’m more optimistic than you and Wei Dai about waiting until we have smart AIs to help us on these problems.
Do you want to talk about why you’re relatively optimistic? I’ve tried to explain my own concerns/pessimism at https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy and https://forum.effectivealtruism.org/posts/axSfJXriBWEixsHGR/ai-doing-philosophy-ai-generating-hands.
I said a little in another thread. If we get aligned AI, I think it’ll likely be a corrigible assistant that doesn’t have its own philosophical views that it wants to act on. And then we can use these assistants to help us solve philosophical problems. I’m imagining in particular that these AIs could be very good at mapping logical space, tracing all the implications of various views, etc. So you could ask a question and receive a response like: ‘Here are the different views on this question. Here’s why they’re mutually exclusive and jointly exhaustive. Here are all the most serious objections to each view. Here are all the responses to those objections. Here are all the objections to those responses,’ and so on. That would be a huge boost to philosophical progress. Progress has been slow so far because human philosophers take entire lifetimes just to fill in one small part of this enormous map, and because humans make errors so later philosophers can’t even trust that small filled-in part, and because verification in philosophy isn’t much quicker than generation.
The argument tree (arguments, counterarguments, counter-counterarguments, and so on) is exponentially sized and we don’t know how deep or wide we need to expand it, before some problem can be solved. We do know that different humans looking at the same partial tree (i.e., philosophers who have read the same literature on some problem) can have very different judgments as to what the correct conclusion is. There’s also a huge amount of intuition/judgment involved in choosing which part of the tree to focus on or expand further. With AIs helping to expand the tree for us, there are potential advantages like you mentioned, but also potential disadvantages, like AIs not having good intuition/judgment about what lines of arguments to pursue, or the argument tree (or AI-generated philosophical literature) becoming too large for any humans to read and think about in a relevant time frame. Many will be very tempted to just let AIs answer the questions / make the final conclusions for us, especially if AIs also accelerate technological progress, creating many urgent philosophical problems related to how to use them safely and beneficially. Or if humans try to make the conclusions, can easily get them wrong despite AI help with expanding the argument tree.
So I think undergoing the AI transition without solving metaphilosophy, or making AIs autonomously competent at philosophy (good at getting correct conclusions by themselves) is enormously risky, even if we have corrigible AIs helping us.
I wrote a post that I think was partly inspired by this discussion. The implication of it here is that I don’t necessarily want philosophers to directly try to solve the many hard philosophical problems relevant to AI alignment/safety (especially given how few of them are in this space or concerned about x-safety), but initially just to try to make them “more legible” to others, including AI researchers, key decision makers, and the public. Hopefully you agree that this is a more sensible position.
Yes, I agree this is valuable, though I think it’s valuable mainly because it increases the probability that people use future AIs to solve these problems, rather than because it will make people slow down AI development or try very hard to solve them pre-TAI.
I’m not sure but I think maybe I also have a different view than you on what problems are going to be bottlenecks to AI development. e.g. I think there’s a big chance that the world would steam ahead even if we don’t solve any of the current (non-philosophical) problems in alignment (interpretability, shutdownability, reward hacking, etc.).
I agree that many of the problems on my list are very hard and probably not the highest marginal value work to be doing from an individual perspective. Keep in mind that the list was written 6 years ago, when it was less clear when the AI takeoff would start in earnest, or how many philosophers will become motivated to work on AI safety when AGI became visibly closer. I still had some hope that when the time came, a significant fraction of all philosophers would become self-motivated or would be “called to arms” by a civilization-wide AI safety effort, and would be given sufficient resources including time, so the list was trying to be more comprehensive (listing every philosophical problem that I thought relevant to AI safety) than prioritizing. Unfortunately, the reality is nearly the completely opposite of this.
Currently, one of my main puzzles is why philosophers with public AI x-risk estimates still have numbers in the 10% range, despite reality being near the most pessimistic of my range of expectations, and it looking like that the AI takeoff/transition will occur while most of these philosophical problems will remain in a wide open or totally confused state, and AI researchers seem almost completely oblivious or uncaring about this. Why are they not making the same kind of argument that I’ve been making, that philosophical difficulty is a reason that AI alignment/x-safety is harder than many think, and an additional reason to pause/stop AI?
I don’t think philosophical difficulty is that much of an increase to the difficulty of alignment, mainly because I think that AI developers should (and likely will) aim to make AIs corrigible assistants rather than agents with their own philosophical views that they try to impose on the world. And I think it’s fairly likely that we can use these assistants (if we succeed in getting them and aren’t disempowered by a misaligned AI instead) to help a lot with these hard philosophical questions.