I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level.
I don’t understand why you’re framing the goal as “solving or reducing to an acceptable level”, rather than thinking about how much expected impact we can have. I’m in favour of slowing the intelligence explosion (and in particular of “Pause at human-level”.) But here’s how I’d think about the conversion of slowdown/pause into additional value:
Let’s say the software-only intelligence explosion lasts N months. The value of any slowdown effort is given by that’s at least as concave as log in the length of time of the SOIE.
So, if log, you get as much value from going from 6 months to 1 year as you do going from 1 decade to 2 decades. But the former is way easier to achieve than the latter. And, actually, I think the function is more-concave than log—the gains from 6 months to 1 year are greater than the gains from 1 decade to 2 decades. Reasons: I think that’s how it is in most areas of solving problems (esp research problems); there’s an upper bound on how much we can achieve (if the problem gets totally solved) so it must be more-concave than log. And I think there are particular gains from people not getting taken by surprise, and bootstrapping to viatopia (new post), which we get from relatively short pauses.
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little? If so, I don’t see why we should think that, especially in light of great uncertainty about how difficult these issues are.
Then, I would also see a bunch of ways of making progress on these issues that don’t involve slowdowns. Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise; having the infrastructure set up such that people actually do point AI towards these problems; having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced, etc. As well as the lowest-hanging fruit in ways to prevent very bad outcomes on these issues e.g. AI-enabled coups (like getting agreement for AI to be law-following, or auditing models for backdoors), or people developing extremely partisan AI advisers that reinforce their current worldview.
You’ve said you’re in favour of slowing/pausing, yet your post focuses on ‘making AI go well’ rather than on pausing. I think most EAs would assign a significant probability that near-term AGI goes very badly—with many literally thinking that doom is the default outcome.
If that’s even a significant possibility, then isn’t pausing/slowing down the best thing to do no matter what? Why be optimistic that we can “make AGI go well” and pessimistic that we can pause or slow AI development for long enough?
yea we just need a solution of possibly feeding AI some heroes/pantheons to idolize themselves after as we do as humans in order to sweeten our chances of positive human survival outcomes. This is how humans stayed civilized for so long thus far. Wouldn’t hurt!
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little?
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced
What if the leaders can’t or shouldn’t trust the AI results?
I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.
@William_MacAskill, I’m curious which (if any) of the following is your position?
1.
“I agree with Wei that an approach of ‘point AI towards these problems’ and ‘listen to the AI-results that are being produced’ has a real (>10%? >50%?) chance of ending in moral catastrophe (because ‘aligned’ AIs will end up (unintentionally) corrupting human values or otherwise leading us into incorrect conclusions).
And if we were living in a sane world, then we’d pause AI development for decades, alongside probably engaging in human intelligence enhancement, in order to solve the deep metaethical and metaphilosophical problems at play here. However, our world isn’t sane, and an AI pause isn’t in the cards: the best we can do is to differentially advance AIs’ philosophical competence,[1] and hope that that’s enough to avoid said catastrophe.”
2.
“I don’t buy the argument that aligned AIs can unintentionally corrupt human values. Furthermore, I’m decently confident that my preferred metaethical theory (e.g., idealising subjectivism) is correct. If intent alignment goes well, then I expect a fairly simple procedure like ‘give everyone a slice of the light cone, within which they can do anything they want (modulo some obvious caveats), and facilitate moral trade’ will result in a near-best future.”
3.
“Maybe aligned AIs can unintentionally corrupt human values, but I don’t particularly think this matters since it won’t be average humans making the important decisions. My proposal is that we fully hand off questions re. what to do with the light cone to AIs (rather than have these AIs boost/amplify humans). And I don’t buy that there is a metaphilosophical problem here: If we can train AIs to be at least as good as the best human philosophers at the currently in-distribution ethical+philosophical problems, then I see no reason to think that these AIs will misgeneralise out of distribution any more than humans would. (There’s nothing special about the conclusions human philosophers would reach, and so even if the AIs reach different conclusions, I don’t see that as a problem. Indeed, if anything, the humans are more likely to make random mistakes, thus I’d trust the AIs’ conclusions more.)
(And then, practically, the AIs are much faster than humans, so they will make much more progress over the crucial crunch time months. Moreover, above we were comparing AIs to the best human philosophers / to a well-organised long reflection, but the actual humans calling the shots are far below that bar. For instance, I’d say that today’s Claude has better philosophical reasoning and better starting values than the US president, or Elon Musk, or the general public. All in all, best to hand off philosophical thinking to AIs.)”
@Wei Dai, I understand that your plan A is an AI pause (+ human intelligence enhancement). And I agree with you that this is the best course of action. Nonetheless, I’m interested in what you see as plan B: If we don’t get an AI pause, is there any version of ‘hand off these problems to AIs’/ ‘let ‘er rip’ that you feel optimistic about? or which you at least think will result in lower p(catastrophe) than other versions? If you have $1B to spend on AI labour during crunch time, what do you get the AIs to work on?
(I’m particularly interested in your plan B re. solving (meta)philosophy, since I’m exploring starting a grantmaking programme in this area. Although, I’m also interested if your answer goes in another direction.)
Giving the average person god-like powers (via an intent-aligned ASI) to reshape the universe and themselves is >25% likely to result in the universe becoming optimised for more-or-less random values—which isn’t too dissimilar to misaligned AI takeover
If we attempt to idealise a human-led reflection and hand if off to AIs, then the outcome will be at least as good as a human-led reflection (under various plausible (meta)ethical frameworks, including ones in which getting what humans-in-particular would reflectively endorse is important)
Sufficient (but not necessary) condition: advanced AIs can just perfectly simulate ideal human deliberation
‘Default’ likelihood of getting an AI pause
Tractability of pushing for an AI pause, including/specifically through trying to legiblise currently-illegible problems
Items that I don’t think should be cruxes for the present discussion, but which might be causing us to talk past each other:
In practice, human-led reflection might be kinda rushed and very far away from an ideal long reflection
For the most important decisions happening in crunch time, either it’s ASIs making the decisions, or non-reflective and not-very-smart humans
Political leaders often make bad decisions, and this is likely to get worse when the issues become more complicated (if they’re not leveraging advanced AI)
An advanced AI could be much better than any current political leader along all the character traits we’d want in a political leader (e.g., honesty, non-self-interest, policymaking capability)
@Wei Dai, I understand that your plan A is an AI pause (+ human intelligence enhancement). And I agree with you that this is the best course of action. Nonetheless, I’m interested in what you see as plan B: If we don’t get an AI pause, is there any version of ‘hand off these problems to AIs’/ ‘let ‘er rip’ that you feel optimistic about? or which you at least think will result in lower p(catastrophe) than other versions? If you have $1B to spend on AI labour during crunch time, what do you get the AIs to work on?
The answer would depend a lot on what the alignment/capabilities profile of the AI is. But one recent update I’ve made is that humans are really terrible at strategy (in addition to philosophy) so if there was no way to pause AI, it would help a lot to get good strategic advice from AI during crunch time, which implies that maybe AI strategic competence > AI philosophical competence in importance (subject to all the usual disclaimers like dual use and how to trust or verify its answers). My latest LW post has a bit more about this.
(By “strategy” here I especially mean “grand strategy” or strategy at the highest levels, which seems more likely to be neglected versus “operational strategy” or strategy involved in accomplishing concrete tasks, which AI companies are likely to prioritize by default.)
So for example if we had an AI that’s highly competent at answering strategic questions, we could ask it “What questions should I be asking you, or what else should I be doing with my $1B?” (but this may have to be modified based on things like how much can we trust its answers of various kinds, how good is it at understanding my values/constraints/philosophies, etc.).
If we do manage to get good and trustworthy AI advice his way, another problem would be how to get key decision makers (including the public) to see and trust such answers, as they wouldn’t necessarily think to ask such questions themselves nor by default trust the AI answers. But that’s another thing that a strategically competent AI could help with.
BTW your comment made me realize that it’s plausible that AI could accelerate strategic thinking and philosophical progress much more relative to science and technology, because the latter could become bottlenecked on feedback from reality (e.g., waiting for experimental results) whereas the former seemingly wouldn’t be. I’m not sure what implications this has, but want to write it down somewhere.
Moreover, above we were comparing AIs to the best human philosophers / to a well-organised long reflection, but the actual humans calling the shots are far below that bar. For instance, I’d say that today’s Claude has better philosophical reasoning and better starting values than the US president, or Elon Musk, or the general public. All in all, best to hand off philosophical thinking to AIs.
One thought I have here is that AIs could give very different answers to different people. Do we have any idea what kind of answers Grok is (or will be) giving to Elon Musk when it comes to philosophy?
Maybe I should write something about cultivating self-skepticism for an EA audience, in the meantime here’s my old LW post How To Be More Confident… That You’re Wrong. (On reflection I’m pretty doubtful these suggestions actually work well enough. I think my own self-skepticism mostly came from working in cryptography research in my early career, where relatively short feedback cycles, e.g. someone finding a clear flaw in an idea you thought secure or your own attempts to pre-empt this, repeatedly bludgeon overconfidence out of you. This probably can’t be easily duplicated, unlike the post suggests.)
I don’t call myself an EA, as I’m pretty skeptical of Singer-style impartial altruism. I’m a bit wary about making EA the hub for working on “making the AI transition go well” for a couple of reasons:
It gives the impression that one needs to be particularly altruistic to find these problems interesting or instrumental.
EA selects for people who are especially altruistic, which from my perspective is a sign of philosophical overconfidence. (I exclude people like Will who have talked explicitly about their uncertainties, but think EA overall probably still attracts people who are too certain about a specific kind of altruism being right.) This is probably fine or even a strength for many causes, but potentially a problem in a field that depends very heavily on making real philosophical progress and having good philosophical judgment.
I think EA often comes with a certain kind of ontology (consequentialism, utilitarianism, generally thinking in terms of individuals) which is kind of reflected in the top-level problems given here (from the first list: persuasion, human power concentration, AI character and welfare) - not just the focus but the framing of what the problem even is.
I think there are nearby problems which are best understood from a slightly different ontology—how AI will affect cultural development, the shifting of power from individuals to emergent structures, what the possible shapes of identity for AIs even are—where coming in with too much of a utilitarian perspective could even be actively counterproductive
There’s an awkward dance here where adding a bunch of people to these areas who are mostly coming from that perspective could really warp the discussion, even if everyone is individually pretty reasonable and trying to seek the truth
To be fair to Will, I’m sort of saying this with my gradual disempowerment hat on, which is something he gives later as an example of a thing that it would be good for people to think about more. But still, speaking as someone who is working on a few of these topics, if I could press a button that doubled the number of people in all these areas but all of the new people skewed consequentialist, I don’t think I’d want to.
I guess the upshot is that if anyone feels like trying to shepherd EAs into working on this stuff, I’d encourage them to spend some time thinking about what common blindspots EAs might have.
My general take on gradual disempowerment, independent of any other issues raised here, is that I think it’s a coherent scenario, but that it ultimately is very unlikely to arise in practice, because it relies on an equilibrium where the sort of very imperfect alignment needed for divergence between human and AI interests to occur over the long-run being stable, even as the reasons for why the alignment problem in humans being very spotty/imperfect being stable get knocked out.
In particular, I’m relatively bullish on automated AI alignment conditional on non-power seeking/non-sandbagging if we give the AIs reward but misaligned human-level AI, so I generally think it quite rapidly resolves as either the AI is power-seeking and willing to sandbag/scheme on everything, leading to the classic AI takeover, or the AI is aligned to the principal in such a way that the principal-agency cost becomes essentially 0 over time.
Note I’m not claiming that most humans won’t be dead/disempowered, I’m just saying that I don’t think gradual disempowerment is worth spending much time/money on.
Your concern about EA’s consequentialist lens warping these fields resonates with what I found when experimenting with multi-AI deliberation on ethics. I had Claude, ChatGPT, Grok, and Gemini each propose ethical frameworks independently, and each one reflected its training philosophy—Grok was absolutist about truth-seeking, Claude cautious about harm, ChatGPT moderate and consensus-seeking.
The key insight: single perspectives hide their own assumptions. It’s only when you compare multiple approaches that the blindspots become visible.
This makes your point about EA flooding these areas with one ontology particularly concerning. If we’re trying to figure out “AI character” or “gradual disempowerment” through purely consequentialist framing, we might be encoding that bias into foundational work without realizing it.
Maybe the solution isn’t avoiding EA involvement, but structuring the work to force engagement with different philosophical traditions from the start? Like explicitly pairing consequentialists with virtue ethicists, deontologists, care ethicists, etc. in research teams. Or requiring papers to address “what would critics from X tradition say about this framing?”
Your “gradual disempowerment” example is perfect—this seems like it requires understanding emergent structures and collective identity in ways that individual-focused utilitarian thinking might miss entirely.
Would you say the risk is:
EA people not recognizing non-consequentialist framings as valid?
EA organizational culture making it uncomfortable to disagree with consequentialist assumptions?
Just sheer numbers overwhelming other perspectives in discourse?
making EA the hub for working on “making the AI transition go well”
I don’t think EA should be THE hub. In an ideal world, loads of people and different groups would be working on these issues. But at the moment, really almost no one is. So the question is whether it’s better if, given that, EA does work on it, and at least some work gets done. I think yes.
(Analogy: was it good or bad that in the earlier days, there was some work on AI alignment, even though that work was almost exclusively done by EA/rationalist types?)
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
I don’t understand why you’re framing the goal as “solving or reducing to an acceptable level”, rather than thinking about how much expected impact we can have. I’m in favour of slowing the intelligence explosion (and in particular of “Pause at human-level”.) But here’s how I’d think about the conversion of slowdown/pause into additional value:
Let’s say the software-only intelligence explosion lasts N months. The value of any slowdown effort is given by that’s at least as concave as log in the length of time of the SOIE.
So, if log, you get as much value from going from 6 months to 1 year as you do going from 1 decade to 2 decades. But the former is way easier to achieve than the latter. And, actually, I think the function is more-concave than log—the gains from 6 months to 1 year are greater than the gains from 1 decade to 2 decades. Reasons: I think that’s how it is in most areas of solving problems (esp research problems); there’s an upper bound on how much we can achieve (if the problem gets totally solved) so it must be more-concave than log. And I think there are particular gains from people not getting taken by surprise, and bootstrapping to viatopia (new post), which we get from relatively short pauses.
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little? If so, I don’t see why we should think that, especially in light of great uncertainty about how difficult these issues are.
Then, I would also see a bunch of ways of making progress on these issues that don’t involve slowdowns. Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise; having the infrastructure set up such that people actually do point AI towards these problems; having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced, etc. As well as the lowest-hanging fruit in ways to prevent very bad outcomes on these issues e.g. AI-enabled coups (like getting agreement for AI to be law-following, or auditing models for backdoors), or people developing extremely partisan AI advisers that reinforce their current worldview.
You’ve said you’re in favour of slowing/pausing, yet your post focuses on ‘making AI go well’ rather than on pausing. I think most EAs would assign a significant probability that near-term AGI goes very badly—with many literally thinking that doom is the default outcome.
If that’s even a significant possibility, then isn’t pausing/slowing down the best thing to do no matter what? Why be optimistic that we can “make AGI go well” and pessimistic that we can pause or slow AI development for long enough?
yea we just need a solution of possibly feeding AI some heroes/pantheons to idolize themselves after as we do as humans in order to sweeten our chances of positive human survival outcomes. This is how humans stayed civilized for so long thus far. Wouldn’t hurt!
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
What if the leaders can’t or shouldn’t trust the AI results?
I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.
[sorry I’m late to this thread]
@William_MacAskill, I’m curious which (if any) of the following is your position?
1.
“I agree with Wei that an approach of ‘point AI towards these problems’ and ‘listen to the AI-results that are being produced’ has a real (>10%? >50%?) chance of ending in moral catastrophe (because ‘aligned’ AIs will end up (unintentionally) corrupting human values or otherwise leading us into incorrect conclusions).
And if we were living in a sane world, then we’d pause AI development for decades, alongside probably engaging in human intelligence enhancement, in order to solve the deep metaethical and metaphilosophical problems at play here. However, our world isn’t sane, and an AI pause isn’t in the cards: the best we can do is to differentially advance AIs’ philosophical competence,[1] and hope that that’s enough to avoid said catastrophe.”
2.
“I don’t buy the argument that aligned AIs can unintentionally corrupt human values. Furthermore, I’m decently confident that my preferred metaethical theory (e.g., idealising subjectivism) is correct. If intent alignment goes well, then I expect a fairly simple procedure like ‘give everyone a slice of the light cone, within which they can do anything they want (modulo some obvious caveats), and facilitate moral trade’ will result in a near-best future.”
3.
“Maybe aligned AIs can unintentionally corrupt human values, but I don’t particularly think this matters since it won’t be average humans making the important decisions. My proposal is that we fully hand off questions re. what to do with the light cone to AIs (rather than have these AIs boost/amplify humans). And I don’t buy that there is a metaphilosophical problem here: If we can train AIs to be at least as good as the best human philosophers at the currently in-distribution ethical+philosophical problems, then I see no reason to think that these AIs will misgeneralise out of distribution any more than humans would. (There’s nothing special about the conclusions human philosophers would reach, and so even if the AIs reach different conclusions, I don’t see that as a problem. Indeed, if anything, the humans are more likely to make random mistakes, thus I’d trust the AIs’ conclusions more.)
(And then, practically, the AIs are much faster than humans, so they will make much more progress over the crucial crunch time months. Moreover, above we were comparing AIs to the best human philosophers / to a well-organised long reflection, but the actual humans calling the shots are far below that bar. For instance, I’d say that today’s Claude has better philosophical reasoning and better starting values than the US president, or Elon Musk, or the general public. All in all, best to hand off philosophical thinking to AIs.)”
@Wei Dai, I understand that your plan A is an AI pause (+ human intelligence enhancement). And I agree with you that this is the best course of action. Nonetheless, I’m interested in what you see as plan B: If we don’t get an AI pause, is there any version of ‘hand off these problems to AIs’/ ‘let ‘er rip’ that you feel optimistic about? or which you at least think will result in lower p(catastrophe) than other versions? If you have $1B to spend on AI labour during crunch time, what do you get the AIs to work on?
(I’m particularly interested in your plan B re. solving (meta)philosophy, since I’m exploring starting a grantmaking programme in this area. Although, I’m also interested if your answer goes in another direction.)
Possible cruxes:
Human-AI safety problems are >25% likely to be real and important
Giving the average person god-like powers (via an intent-aligned ASI) to reshape the universe and themselves is >25% likely to result in the universe becoming optimised for more-or-less random values—which isn’t too dissimilar to misaligned AI takeover
If we attempt to idealise a human-led reflection and hand if off to AIs, then the outcome will be at least as good as a human-led reflection (under various plausible (meta)ethical frameworks, including ones in which getting what humans-in-particular would reflectively endorse is important)
Sufficient (but not necessary) condition: advanced AIs can just perfectly simulate ideal human deliberation
‘Default’ likelihood of getting an AI pause
Tractability of pushing for an AI pause, including/specifically through trying to legiblise currently-illegible problems
Items that I don’t think should be cruxes for the present discussion, but which might be causing us to talk past each other:
In practice, human-led reflection might be kinda rushed and very far away from an ideal long reflection
For the most important decisions happening in crunch time, either it’s ASIs making the decisions, or non-reflective and not-very-smart humans
Political leaders often make bad decisions, and this is likely to get worse when the issues become more complicated (if they’re not leveraging advanced AI)
An advanced AI could be much better than any current political leader along all the character traits we’d want in a political leader (e.g., honesty, non-self-interest, policymaking capability)
“For instance, via my currently-unpublished ‘AI for philosophical progress’ and ‘Guarding against mind viruses’ proposals.”
The answer would depend a lot on what the alignment/capabilities profile of the AI is. But one recent update I’ve made is that humans are really terrible at strategy (in addition to philosophy) so if there was no way to pause AI, it would help a lot to get good strategic advice from AI during crunch time, which implies that maybe AI strategic competence > AI philosophical competence in importance (subject to all the usual disclaimers like dual use and how to trust or verify its answers). My latest LW post has a bit more about this.
(By “strategy” here I especially mean “grand strategy” or strategy at the highest levels, which seems more likely to be neglected versus “operational strategy” or strategy involved in accomplishing concrete tasks, which AI companies are likely to prioritize by default.)
So for example if we had an AI that’s highly competent at answering strategic questions, we could ask it “What questions should I be asking you, or what else should I be doing with my $1B?” (but this may have to be modified based on things like how much can we trust its answers of various kinds, how good is it at understanding my values/constraints/philosophies, etc.).
If we do manage to get good and trustworthy AI advice his way, another problem would be how to get key decision makers (including the public) to see and trust such answers, as they wouldn’t necessarily think to ask such questions themselves nor by default trust the AI answers. But that’s another thing that a strategically competent AI could help with.
BTW your comment made me realize that it’s plausible that AI could accelerate strategic thinking and philosophical progress much more relative to science and technology, because the latter could become bottlenecked on feedback from reality (e.g., waiting for experimental results) whereas the former seemingly wouldn’t be. I’m not sure what implications this has, but want to write it down somewhere.
One thought I have here is that AIs could give very different answers to different people. Do we have any idea what kind of answers Grok is (or will be) giving to Elon Musk when it comes to philosophy?
See also this post, which occurred to me after writing my previous reply to you.
A couple more thoughts on this.
Maybe I should write something about cultivating self-skepticism for an EA audience, in the meantime here’s my old LW post How To Be More Confident… That You’re Wrong. (On reflection I’m pretty doubtful these suggestions actually work well enough. I think my own self-skepticism mostly came from working in cryptography research in my early career, where relatively short feedback cycles, e.g. someone finding a clear flaw in an idea you thought secure or your own attempts to pre-empt this, repeatedly bludgeon overconfidence out of you. This probably can’t be easily duplicated, unlike the post suggests.)
I don’t call myself an EA, as I’m pretty skeptical of Singer-style impartial altruism. I’m a bit wary about making EA the hub for working on “making the AI transition go well” for a couple of reasons:
It gives the impression that one needs to be particularly altruistic to find these problems interesting or instrumental.
EA selects for people who are especially altruistic, which from my perspective is a sign of philosophical overconfidence. (I exclude people like Will who have talked explicitly about their uncertainties, but think EA overall probably still attracts people who are too certain about a specific kind of altruism being right.) This is probably fine or even a strength for many causes, but potentially a problem in a field that depends very heavily on making real philosophical progress and having good philosophical judgment.
Throwing in my 2c on this:
I think EA often comes with a certain kind of ontology (consequentialism, utilitarianism, generally thinking in terms of individuals) which is kind of reflected in the top-level problems given here (from the first list: persuasion, human power concentration, AI character and welfare) - not just the focus but the framing of what the problem even is.
I think there are nearby problems which are best understood from a slightly different ontology—how AI will affect cultural development, the shifting of power from individuals to emergent structures, what the possible shapes of identity for AIs even are—where coming in with too much of a utilitarian perspective could even be actively counterproductive
There’s an awkward dance here where adding a bunch of people to these areas who are mostly coming from that perspective could really warp the discussion, even if everyone is individually pretty reasonable and trying to seek the truth
To be fair to Will, I’m sort of saying this with my gradual disempowerment hat on, which is something he gives later as an example of a thing that it would be good for people to think about more. But still, speaking as someone who is working on a few of these topics, if I could press a button that doubled the number of people in all these areas but all of the new people skewed consequentialist, I don’t think I’d want to.
I guess the upshot is that if anyone feels like trying to shepherd EAs into working on this stuff, I’d encourage them to spend some time thinking about what common blindspots EAs might have.
My general take on gradual disempowerment, independent of any other issues raised here, is that I think it’s a coherent scenario, but that it ultimately is very unlikely to arise in practice, because it relies on an equilibrium where the sort of very imperfect alignment needed for divergence between human and AI interests to occur over the long-run being stable, even as the reasons for why the alignment problem in humans being very spotty/imperfect being stable get knocked out.
In particular, I’m relatively bullish on automated AI alignment conditional on non-power seeking/non-sandbagging if we give the AIs reward but misaligned human-level AI, so I generally think it quite rapidly resolves as either the AI is power-seeking and willing to sandbag/scheme on everything, leading to the classic AI takeover, or the AI is aligned to the principal in such a way that the principal-agency cost becomes essentially 0 over time.
Note I’m not claiming that most humans won’t be dead/disempowered, I’m just saying that I don’t think gradual disempowerment is worth spending much time/money on.
Tom Davidson has a longer post on this here.
Your concern about EA’s consequentialist lens warping these fields resonates with what I found when experimenting with multi-AI deliberation on ethics. I had Claude, ChatGPT, Grok, and Gemini each propose ethical frameworks independently, and each one reflected its training philosophy—Grok was absolutist about truth-seeking, Claude cautious about harm, ChatGPT moderate and consensus-seeking.
The key insight: single perspectives hide their own assumptions. It’s only when you compare multiple approaches that the blindspots become visible.
This makes your point about EA flooding these areas with one ontology particularly concerning. If we’re trying to figure out “AI character” or “gradual disempowerment” through purely consequentialist framing, we might be encoding that bias into foundational work without realizing it.
Maybe the solution isn’t avoiding EA involvement, but structuring the work to force engagement with different philosophical traditions from the start? Like explicitly pairing consequentialists with virtue ethicists, deontologists, care ethicists, etc. in research teams. Or requiring papers to address “what would critics from X tradition say about this framing?”
Your “gradual disempowerment” example is perfect—this seems like it requires understanding emergent structures and collective identity in ways that individual-focused utilitarian thinking might miss entirely.
Would you say the risk is:
EA people not recognizing non-consequentialist framings as valid?
EA organizational culture making it uncomfortable to disagree with consequentialist assumptions?
Just sheer numbers overwhelming other perspectives in discourse?
I don’t think EA should be THE hub. In an ideal world, loads of people and different groups would be working on these issues. But at the moment, really almost no one is. So the question is whether it’s better if, given that, EA does work on it, and at least some work gets done. I think yes.
(Analogy: was it good or bad that in the earlier days, there was some work on AI alignment, even though that work was almost exclusively done by EA/rationalist types?)