Wei Dai
Problems I’ve Tried to Legibilize
The argument tree (arguments, counterarguments, counter-counterarguments, and so on) is exponentially sized and we don’t know how deep or wide we need to expand it, before some problem can be solved. We do know that different humans looking at the same partial tree (i.e., philosophers who have read the same literature on some problem) can have very different judgments as to what the correct conclusion is. There’s also a huge amount of intuition/judgment involved in choosing which part of the tree to focus on or expand further. With AIs helping to expand the tree for us, there are potential advantages like you mentioned, but also potential disadvantages, like AIs not having good intuition/judgment about what lines of arguments to pursue, or the argument tree (or AI-generated philosophical literature) becoming too large for any humans to read and think about in a relevant time frame. Many will be very tempted to just let AIs answer the questions / make the final conclusions for us, especially if AIs also accelerate technological progress, creating many urgent philosophical problems related to how to use them safely and beneficially. Or if humans try to make the conclusions, can easily get them wrong despite AI help with expanding the argument tree.
So I think undergoing the AI transition without solving metaphilosophy, or making AIs autonomously competent at philosophy (good at getting correct conclusions by themselves) is enormously risky, even if we have corrigible AIs helping us.
Do you want to talk about why you’re relatively optimistic? I’ve tried to explain my own concerns/pessimism at https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy and https://forum.effectivealtruism.org/posts/axSfJXriBWEixsHGR/ai-doing-philosophy-ai-generating-hands.
Thanks! I hope this means you’ll spend some more time on this type of work, and/or tell other philosophers about this argument. It seems apparent that we need more philosophers to work on philosophical problems related to AI x-safety (many of which do not seem to be legible to most non-philosophers). Not necessarily by attacking them directly (this is very hard and probably not the best use of time, as we previously discussed) but instead by making them more legible to AI researchers, decisionmakers, and the general public.
In my view, there isn’t much desire for work like this from people in the field and they probably wouldn’t use it to inform deployment unless a lot of effort is also added from the author to meet the right people, convince theme to spend the time to take it seriously etc.
Any thoughts on Legible vs. Illegible AI Safety Problems, which is in part a response to this?
I wrote a post that I think was partly inspired by this discussion. The implication of it here is that I don’t necessarily want philosophers to directly try to solve the many hard philosophical problems relevant to AI alignment/safety (especially given how few of them are in this space or concerned about x-safety), but initially just to try to make them “more legible” to others, including AI researchers, key decision makers, and the public. Hopefully you agree that this is a more sensible position.
Legible vs. Illegible AI Safety Problems
I agree that many of the problems on my list are very hard and probably not the highest marginal value work to be doing from an individual perspective. Keep in mind that the list was written 6 years ago, when it was less clear when the AI takeoff would start in earnest, or how many philosophers will become motivated to work on AI safety when AGI became visibly closer. I still had some hope that when the time came, a significant fraction of all philosophers would become self-motivated or would be “called to arms” by a civilization-wide AI safety effort, and would be given sufficient resources including time, so the list was trying to be more comprehensive (listing every philosophical problem that I thought relevant to AI safety) than prioritizing. Unfortunately, the reality is nearly the completely opposite of this.
Currently, one of my main puzzles is why philosophers with public AI x-risk estimates still have numbers in the 10% range, despite reality being near the most pessimistic of my range of expectations, and it looking like that the AI takeoff/transition will occur while most of these philosophical problems will remain in a wide open or totally confused state, and AI researchers seem almost completely oblivious or uncaring about this. Why are they not making the same kind of argument that I’ve been making, that philosophical difficulty is a reason that AI alignment/x-safety is harder than many think, and an additional reason to pause/stop AI?
Right, I know about Will MacAskill, Joe Carlsmith, and your work in this area, but none of you are working on alignment per se full time or even close to full time AFAIK, and the total effort is clearly far from adequate to the task at hand.
I think some have given up philosophy to work on other things such as AI alignment.
Any other names you can cite?
In my view, there isn’t much desire for work like this from people in the field and they probably wouldn’t use it to inform deployment unless a lot of effort is also added from the author to meet the right people, convince them to spend the time to take it seriously etc.
Thanks, this makes sense to me, and my follow-up is how concerning do you think this situation is?
One perspective I have is that at this point, several years into a potential AI takeoff, with AI companies now worth trillions in aggregate, alignment teams at AI companies still have virtually no professional philosophical oversight (or outside consultants that they rely on), and are kind of winging it based on their own philosophical beliefs/knowledge. It seems rather like trying to build a particle collider or fusion reaction with no physicists on the staff, only engineers.
(Or worse, unlike engineers’ physics knowledge, I doubt that receiving a systematic education in fields like ethics and metaethics is a hard requirement for working as an alignment researcher. And even worse, unlike the situation in physics, we don’t even have settled ethics/metaethics/metaphilosophy/etc. that alignment researchers can just learn and apply.)
Maybe the AI companies are reluctant to get professional philosophers involved, because in the fields that do have “professional philosophical oversight”, e.g., bioethics, things haven’t worked out that well. (E.g. human challenge trials being banned during COVID.) But to me, this would be a signal to yell loudly that our civilization is far from ready to attempt or undergo an AI transition, rather than a license to wing it based on one’s own philosophical beliefs/knowledge.
As an outsider, the situation seems cray alarming to me, and I’m confused that nobody else is talking about it, including philosophers like you who are in the same overall space and looking at roughly the same things. I wonder if you have a perspective that makes the situation not quite as alarming as it appears to me.
Do you have any insights into why there are so few philosophers working in AI alignment, or closely with alignment researchers? (Amanda Askell is the only one I know.) Do you think this is actually a reasonable state of affairs (i.e., it’s right or fine that almost no professional philosophers work directly as or with alignment researchers), or is this wrong/suboptimal, caused by some kind of cultural or structural problem? It’s been 6 years since I wrote Problems in AI Alignment that philosophers could potentially contribute to and I’ve gotten a few comments from philosophers saying they found the list helpful or that they’ll think about working on some of the problems, but I’m not aware of any concrete follow-ups.
If it is some kind of cultural or structural problem, it might be even higher leverage to work on solving that, instead of object level philosophical problems. I’d try to do this myself, but as an outsider to academic philosophy and also very far from any organizations who might potentially hire philosophers to work on AI alignment, it’s hard for me to even observe what the problem might be.
Interesting re belief in hell being a key factor, I wasn’t thinking about that.
It seems like the whole AI x-risk community has latched onto “align AI with human values/intent” as the solution, with few people thinking even a few steps ahead to “what if we succeeded”? I have a post related to this if you’re interested.
possibly the future economy will be so much more complicated that it will still make sense to have some distributed information processing in the market rather than have all optimisation centrally planned
I think there will be distributed information processing, but each distributed node/agent will be a copy of the central AGI (or otherwise aligned to it or shares its values), because this is what’s economically most efficient, minimizes waste from misaligned incentives and so on. So there won’t be the kind of value pluralism that we see today.
I assume we won’t be able to know with high confidence in advance what economic model will be most efficient post-ASI.
There’s probably a lot of other surprises that we can’t foresee today. I’m mostly claiming that post-AGI economics and governance probably wont look very similar to today’s.
Why do you think this work has less value than solving philosophical problems in AI safety?
From the perspective of comparative advantage and counterfactual impact, this work does not seem to require philosophical training. It seems to be straightforward empirical research, that many people could do, besides the very few professionally trained AI-risk-concerned philosophers that humanity has.
To put it another way, I’m not sure that Toby was wrong to work on this, but if he was, it’s because if he hadn’t, then someone else with more comparative advantage for working on this problem (due to lacking training or talent for philosophy) would have done so shortly afterwards.
While I appreciate this work being done, it seems a very bad sign for our world/timeline that the very few people with both philosophy training and an interest in AI x-safety are using their time/talent to do forecasting (or other) work instead of solving philosophical problems in AI x-safety, with Daniel Kokotajlo being another prominent example.
This implies one of two things: Either they are miscalculating the best way to spend their time, which indicates bad reasoning or intuitions even among humanity’s top philosophers (i.e., those who have at least realized the importance of AI x-risk and are trying to do something about it). Or they actually are the best people (in a comparative advantage sense) available to work on these other problems, in which case the world must be on fire, and they’re having to delay working on extremely urgent problems that they were trained for, to put out even bigger fires.
(Cross-posted to LW and EAF.)
The ethical schools of thought I’m most aligned with—longtermism, sentientism, effective altruism, and utilitarianism—are far more prominent in the West (though still very niche).
I want to point out that the ethical schools of thought that you’re (probably) most anti-aligned with (e.g., that certain behaviors and even thoughts are deserving of eternal divine punishment) are also far more prominent in the West, proportionately even more so than the ones you’re aligned with.
Also the Western model of governance may not last into the post-AGI era regardless of where the transition starts. Aside from the concentration risk mentioned in the linked post, driven by post-AGI economics, I think different sub-cultures in the West breaking off into AI-powered autarkies or space colonies with vast computing power, governed by their own rules, is also a very scary possibility.
I’m pretty torn and may actually slightly prefer a CCP-dominated AI future (despite my family’s past history with the CCP). But more importantly I think both possibilities are incredibly risky if the AI transition occurs in the near future.
Whereas it seems like maybe you think it’s convex, such that smaller pauses or slowdowns do very little?
I think my point in the opening comment does not logically depend on whether the risk vs time (in pause/slowdown) curve is convex or concave[1], but it may be a major difference in how we’re thinking about the situation, so thanks for surfacing this. In particular I see 3 large sources of convexity:
The disjunctive nature of risk / conjunctive nature of success. If there are N problems that all have to solved correctly to get a near-optimal future, without losing most of the potential value of the universe, then that can make the overall risk curve convex or at least less concave. For example compare f(x) = 1 − 1/2^(1 + x/10) and f^4.
Human intelligence enhancements coming online during the pause/slowdown, with each maturing cohort potentially giving a large speed boost for solving these problems.
Rationality/coordination threshold effect, where if humanity makes enough intellectual or other progress to subsequently make an optimal or near-optimal policy decision about AI (e.g., realize that we should pause AI development until overall AI risk is at some acceptable level, or something like this but perhaps more complex involving various tradeoffs), then that last bit of effort or time to get to this point has a huge amount of marginal value.
Like: putting in the schlep to RL AI and create scaffolds so that we can have AI making progress on these problems months earlier than we would have done otherwise
I think this kind of approach can backfire badly (especially given human overconfidence), because we currently don’t know how to judge progress on these problems except by using human judgment, and it may be easier for AIs to game human judgment than to make real progress. (Researchers trying to use LLMs as RL judges apparently run into the analogous problem constantly.)
having governance set up such that the most important decision-makers are actually concerned about these issues and listening to the AI-results that are being produced
What if the leaders can’t or shouldn’t trust the AI results?
- ^
I’m trying to coordinate with, or avoid interfering with, people who are trying to implement an AI pause or create conditions conducive to a future pause. As mentioned in the grandparent comment, one way people like us could interfere with such efforts is by feeding into a human tendency to be overconfident about one’s own ideas/solutions/approaches.
A couple more thoughts on this.
Maybe I should write something about cultivating self-skepticism for an EA audience, in the meantime here’s my old LW post How To Be More Confident… That You’re Wrong. (On reflection I’m pretty doubtful these suggestions actually work well enough. I think my own self-skepticism mostly came from working in cryptography research in my early career, where relatively short feedback cycles, e.g. someone finding a clear flaw in an idea you thought secure or your own attempts to pre-empt this, repeatedly bludgeon overconfidence out of you. This probably can’t be easily duplicated, unlike the post suggests.)
I don’t call myself an EA, as I’m pretty skeptical of Singer-style impartial altruism. I’m a bit wary about making EA the hub for working on “making the AI transition go well” for a couple of reasons:
It gives the impression that one needs to be particularly altruistic to find these problems interesting or instrumental.
EA selects for people who are especially altruistic, which from my perspective is a sign of philosophical overconfidence. (I exclude people like Will who have talked explicitly about their uncertainties, but think EA overall probably still attracts people who are too certain about a specific kind of altruism being right.) This is probably fine or even a strength for many causes, but potentially a problem in a field that depends very heavily on making real philosophical progress and having good philosophical judgment.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
Perhaps the most important question is whether you support a restriction on space colonization (completely or to a few nearby planets) during the Long Reflection. Unrestricted colonization seems good from a pure pro-natalist perspective, but bad from an optionalist perspective, as it makes much more likely that if anti-natalism (or adjacent positions like there should be strict care or controls over what lives can be brought into existence) is right, some of the colonies will fail to reach the correct conclusion and go on to colonize the universe in an unrestricted way, thus making humanity as a whole unable to implement the correct option.
If you do support such a restriction, then I think we agree on “the highest order bits” or the most important policy implication of optionalism, but probably still disagree on what is the best population size during the Long Reflection, which may be unresolvable due to our differing intuitions. I think I probably have more sympathy for anti-natalist intuitions than you do (in particular that most current lives may have negative value and people are mistaken about this), and worry more that creating negative-value lives and/or bringing lives into existence without adequate care could constitute a kind of irreversible or irreparable moral error. Unfortunately I do not see a good way to resolve such disagreements at our current stage of philosophical progress.
I think both natalism and anti-natalism risk committing moral atrocities, if their opposite position turns out to be correct. Natalism if either people are often mistaken about their lives being worth living (cf Deluded Gladness Argument), or bringing people into existence requires much more due diligence about understanding/predicting their specific well-informed preferences (perhaps more understanding than current science and philosophy allow). Anti-natalism if human extinction implies losing an astronomically large amount of potential value (cf Astronomical Waste).
My own position, which might be called “min-natalism” or “optionalism”, is that we should ideally aim for a minimal population that’s necessary to prevent extinction and foster philosophical progress. This would maintain our optionality for pursuing natalism, anti-natalism, or something else later, while acknowledging and attempting to minimize the relevant moral risks, until we can more definitively answer the various philosophical questions that these positions depend on.
(It occurs to me this is essentially the Long Reflection, applied to the natalism question, but I don’t think I’ve seen anyone explicitly take this position or make this connection before. It seems somewhat surprising that it’s not a more popular perspective in the natalism vs anti-natalism debate.)
I wish you titled the post something like “The option value argument for preventing extinction doesn’t work”. Your current title (“The option value argument doesn’t work when it’s most needed”) has the unfortunate side effects of:
People being more likely to misinterpret or misremember your post as claiming that trying to increase option value doesn’t work in general.
Reducing extinction risk becomes the most salient example of an idea for increasing option value.
People using “the option value argument” to mean the the option value argument for preventing extinction, even when this can’t be inferred from context. (See example.)
It’s harder to use the phrase “the option value argument” contextually to refer to the option value argument currently or previously discussed, when it’s not about extinction risk, due to it becoming a term of art for “the option value argument for preventing extinction”.
I think it may not be too late to change the title and stop or reverse these effects.