I tend to disagree with most EAs about existential risk from AI. Unfortunately, my disagreements are all over the place. It’s not that I disagree with one or two key points: there are many elements of the standard argument that I diverge from, and depending on the audience, I don’t know which points of disagreement people think are most important.
I want to write a post highlighting all the important areas where I disagree, and offering my own counterarguments as an alternative. This post would benefit from responding to an existing piece, along the same lines as Quintin Pope’s article “My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”″. By contrast, it would be intended to address the EA community as a whole, since I’m aware many EAs already disagree with Yudkowsky even if they buy the basic arguments for AI x-risks.
My question is: what is the current best single article (or set of articles) that provide a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?
I was considering replying to Joseph Carlsmith’s article, “Is Power-Seeking AI an Existential Risk?”, since it seemed reasonably comprehensive and representative of the concerns EAs have about AI x-risk. However, I’m a bit worried that the article is not very representative of EAs who have substantial probabilities of doom, since he originally estimated a total risk of catastrophe at only 5% before 2070. In May 2022, Carlsmith changed his mind and reported a higher probability, but I am not sure whether this is because he has been exposed to new arguments, or because he simply thinks the stated arguments are stronger than he originally thought.
I suspect I have both significant moral disagreements and significant empirical disagreements with EAs, and I want to include both in such an article, while mainly focusing on the empirical points. For example, I have the feeling that I disagree with most EAs about:
How bad human disempowerment would likely be from a utilitarian perspective, and what “human disempowerment” even means in the first place
Whether there will be a treacherous turn event, during which AIs violently take over the world after previously having been behaviorally aligned with humans
How likely AIs are to coordinate near-perfectly with each other as a unified front, leaving humans out of their coalition
Whether we should expect AI values to be “alien” (like paperclip maximizers) in the absence of extraordinary efforts to align them with humans
Whether the AIs themselves will be significant moral patients, on par with humans
Whether there will be a qualitative moment when “the AGI” is created, rather than systems incrementally getting more advanced, with no clear finish line
Whether we get only “one critical try” to align AGI
Whether “AI lab leaks” are an important source of AI risk
How likely AIs are to kill every single human if they are unaligned with humans
Whether there will be a “value lock-in” event soon after we create powerful AI that causes values to cease their evolution over the coming billions of years
How bad problems related to “specification gaming” will be in the future
How society is likely to respond to AI risks, and whether they’ll sleepwalk into a catastrophe
However, I also disagree with points made by many other EAs who have argued against the standard AI risk case. For example, I think that,
AIs will eventually become vastly more powerful and smarter than humans. So, I think AIs will eventually be able to “defeat all of us combined”
I think a benign “AI takeover” event is very likely even if we align AIs successfully
AIs will likely be goal-directed in the future. I don’t think, for instance, that we can just “not give the AIs goals” and then everything will be OK.
I think it’s highly plausible that AIs will end up with substantially different values from humans (although I don’t think this will necessarily cause a catastrophe).
I don’t think we have strong evidence that deceptive alignment is an easy problem to solve at the moment
I think it’s plausible that AI takeoff will be relatively fast, and the world will be dramatically transformed over a period of several months or a few years
I think short timelines, meaning a dramatic transformation of the world within 10 years from now, is pretty plausible
I’d like to elaborate on as many of these points as possible, preferably by responding to direct quotes from the representative article arguing for the alternative, more standard EA perspective.
I expect that your search for a “unified resource” will be unsatisfying. I think people disagree enough on their threat models/expectations that there is no real “EA perspective”.
Some things you could consider doing:
Having a dialogue with 1-2 key people you disagree with
Pick one perspective (e.g., Paul’s worldview, Eliezer’s worldview) and write about areas you disagree with it.
Write up a “Matthew’s worldview” doc that focuses more on explaining what you expect to happen and isn’t necessarily meant as a “counterargument” piece.
Among the questions you list, I’m most interested in these:
How bad human disempowerment would likely be from a utilitarian perspective
Whether there will be a treacherous turn event, during which AIs violently take over the world after previously having been behaviorally aligned with humans
How likely AIs are to kill every single human if they are unaligned with humans
How society is likely to respond to AI risks, and whether they’ll sleepwalk into a catastrophe
I agree there’s no single unified resource. Having said that, I found Richard Ngo’s “five alignment clusters” pretty helpful for bucketing different groups & arguments together. Reposting below:
To return to the question “what is the current best single article (or set of articles) that provide a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?”, my guess is that these different groups would respond as follows:[1]
MIRI cluster: List of Lethalities, Sharp Left Turn, Superintelligence
Structural Risk cluster: Natural selection favours AIs, RAAP
Constellation cluster: Is Power-seeking AI an x-risk, some Cold Takes posts, Scheming AIs
Prosaic cluster: Concrete problems in AI safety, [perhaps something more recent?]
Mainstream cluster: Reform AI Alignment, [not sure—perhaps nothing arguing for >10%?]
But I could easily be misrepresenting these different groups’ “core” argument, and I haven’t read all of these, so could be misunderstanding
I agree that there is no real “EA perspective”, but it seems like there could be a unified doc that a large cluster of people end up roughly endorsing. E.g., I think that if Joe Carlsmith wrote another version of “Is Power-Seeking AI an Existential Risk?” in the next several years, then it’s plausible that a relevant cluster of people would end up thinking this basically lays out the key arguments and makes the right arguments. (I’m unsure what I currently think about the old version of the doc, but I’m guessing I’ll think it misses some key arguments that now seem more obvious.)
I think the closest thing to an EA perspective written relatively recently that is all in a single doc is probably this pdf of Holdens most important century sequence on cold takes.
Unfortunately, I don’t think there is any such article which seems basically up-to-date and reasonable to me.
Here are reasonably up-to-date posts which seem pretty representative to me, but aren’t comprehensive. Hopefully this is still somewhat helpful:
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Scheming AIs: Will AIs fake alignment during training in order to get power?
What a compute-centric framework says about AI takeoff speeds
On specifically “will the AI literally kill everyone”, I think the most up-to-date discussion is here, here, and here.
I think an updated comprehensive case is an open project that might happen in the next few years.
Thanks. It’s unfortunate there isn’t any single article that presents the case comprehensively. I’m OK with replying to multiple articles as an alternative.
In regards to the pieces you mentioned:
My understanding is that (as per the title) this piece argued that a catastrophe is likely without specific countermeasures, but it seems extremely likely that specific countermeasures will be taken to prevent a catastrophe, at least indirectly. Do you know about any other pieces that argue something more long the lines “actually there is a decent chance of an AI catastrophe even given normal counter-efforts”?
While I haven’t digested this post yet, my very shallow impression is that it focuses mainly on whether AIs will lie to get power sometimes, rather than whether this behavior will happen frequently and severely enough to lead to a catastrophe. I think it is very likely that AIs will sometimes lie to get power, just as humans do, but it seems like there’s a lot more that you’d need to argue to show that this might be catastrophic. Am I wrong in my impression?
I’d like to note that I don’t think I have any critical disagreements with this piece, and overall it doesn’t seem to be directly about AI x-risk per se.
This suggests that you hold a view where one of the cruxes with mainstream EA views is “EAs believe there won’t be countermeasures, but countermeasures are very likely, and they significantly mitigate the risk from AI beyond what EAs believe.” (If that is not one of your cruxes, then you can ignore the rest of this!)
The confusing thing about that is, what if EA activities are a key reason why good countermeasures end up being taken against AI? In that case, EA arguments would be a “victim” of their own success (though no one would be complaining!) But that doesn’t seem like a reason to disagree right now, when there is the common ground of “specific countermeasures really need to be taken”.
I find that quite unlikely. I think EA activities contribute on the margin, but it seems very likely to me that people would eventually have taken measures against AI risk in the absence of any EA movement.
In general, while I agree we should not take this argument so far, so that EA ideas do not become “victims of their own success”, I also think neglectedness is a standard barometer EAs have used to judge the merits of their interventions. And I think AI risk mitigation will very likely not be a neglected field in the future. This should substantially downweight our evaluation of AI risk mitigation efforts.
In a trivial example, you’d surely concede that EAs should not try to, e.g. work on making sure that future spacecraft designs are safe? Advanced spacecrafts could indeed play a very important role in the future; but it seems unlikely that society would neglect to work on spacecraft safety, making this a pretty unimportant problem to work on right now. To be clear, I definitely don’t think the case for working on AI risk mitigation is as bad as the case for working on spacecraft safety, but my point is that the idea I’m trying to convey here applies in both cases.
The descriptions you gave all seem reasonable to me. Some responses:
I’m afraid not. However, I do actually think that relatively minimal countermeasures is at least plausible.
This seems like a slight understatement to me. I think it argues for it being plausible that AIs will systematically take actions to acquire power later. Then, it would be severe enough to cause a catastrophe if AIs were capable enough to overcome other safeguards in practice.
One argument for risk is as follows:
It’s reasonably likely that powerful AIs will be schemers and this scheming won’t be removable with current technology without “catching” the schemer in the act (as argued for by Carlsmith 2023 which I linked)
Prior to technology advancing enough to remove scheming, these scheming AIs will be able to take over and they will do so sucessfully.
Neither step in the argument is trivial. For (2), the key questions are:
How much will safety technology advance due to the efforts of human researchers prior to powerful AI?
When scheming AIs first become transformatively useful for safety work, will we be able to employ countermeasures which allow us to extract lots of useful work from these AIs while still preventing them from being able to take over without getting caught? (See our recent work on AI Control for instance.)
What happens when scheming AIs are caught? Is this sufficient?
How long will we have with the first transformatively useful AIs prior to much more powerful AI being developed? So, how much work will we be able to extract out of these AIs?
Can we actually get AIs to productively work on AI safety? How will we check their work given that they might be trying to screw us over.
Many of the above questions depend on the strength of the societal response.
This is just focused on the scheming threat model which is not the only threat model.
We (redwood research where I work) might put out some posts soon which indirectly argue for (2) not necessarily going well by default. (This will also argue for tractability).
Agreed, but relatively little time is an important part of the overall threat model so it seems relevant to reference when making the full argument.
Doesn’t go much into probabilities or extinction and may therefore not be what you are looking for, but I’ve found Dan Hendrycks’ overview/introduction to AI risks to be a pretty comprehensive collection. (https://arxiv.org/abs/2306.12001)
(I for one, would love to see someone critique this, although the FAQ at the end is already a good start to some counterarguments and possible responses to those)
I second this.
Also worth mentioning https://arxiv.org/abs/2306.06924 - a paper by Critch and Russell of the very similar genre.
I was reading one today I think on a similar vein to those you mention
https://library.oapen.org/bitstream/handle/20.500.12657/75844/9781800647886.pdf?sequence=1#page=226
The Main Sources of AI Risk?
Not exactly what you’re asking for, but you could use it as a reference for all of the significant risks that different people have brought up, to select which ones you want to further research and address in your response post.
This is definitely not a “canonical” answer, but one of the things I find myself most frequently linking back to is Eliezer’s List of Lethalities. I do think it is pretty comprehensive in what it covers, though isn’t structured as a strict argument.
An artificially structured argument for expecting AGI ruin.
Opinions on this are pretty diverse. I largely agree with the bulleted list of things-you-think, and this article paints a picture of my current thinking.
My threat model is something like: the very first AGIs will probably be near human-level and won’t be too hard to limit/control. But in human society, tyrants are overrepresented among world leaders, relative to tyrants in the population of people smart enough to lead a country. We’ll probably end up inventing multiple versions of AGI, some of which may be straightforwardly turned into superintelligences and others not. The worst superintelligence we help to invent may win, and if it doesn’t it’ll probably be because a different one beats it (or reaches an unbeatable position first). Humans will probably be sidelined if we survive a battle between super-AGIs. So it would be much safer not to invent them―but it’s also hard to avoid inventing them! I have low confidence in my P(catastrophe) and I’m unsure how to increase my confidence.
But I prefer estimating P(catastrophe) over P(doom) because extinction is not all that concerns me. Some stories about AGI lead to extinction, others to mass death, others to dystopia (possibly followed by mass death later), others to utopia followed by catastrophe, and still others to a stable and wonderful utopia (with humanity probably sidelined eventually, which may even be a good thing). I think I could construct a story along any of these lines.
Well said. I also think it’s important to define what is meant by “catastrophe.” Just as an example, I personally would consider it catastrophic to see a future in which humanity is sidelined and subjugated by an AGI (even a “friendly,” aligned one), but many here would likely disagree with me that this would be a catastrophe. I’ve even heard otherwise rational (non-EA) people claim a future in which humans are ‘pampered pets’ of an aligned ASI to be ‘utopian,’ which just goes to show the level of disagreement.
To me, it’s important whether the AGIs are benevolent and have qualia/consciousness. If AGIs are ordinary computers but smart, I may agree; if they are conscious and benevolent, I’m okay being a pet.
I’m not sure whether we could ever truly know if an AGI was conscious or experienced qualia (which are by definition not quantifiable). And you’re probably right that being a pet of a benevolent ASI wouldn’t be a miserable thing (but it is still an x-risk … because it permanently ends humanity’s status as a dominant species).
I would caution against thinking the Hard Problem of Consciousness is unsolvable “by definition” (if it is solved, qualia will likely become quantifiable). I think the reasonable thing is to presume it is solvable. But until it is solved we must not allow AGI takeover, and even if AGIs stay under human control, it could lead to a previously unimaginable power imbalance between a few humans and the rest of us.
A new paper on this came out recently: https://link.springer.com/article/10.1007/s00146-024-01930-2
I gave a brief reply to the paper here.
I think you’ve summarized the general state of EA views on x-risk with artificial intelligence—thanks! My views* are considered extreme around here, but I think it’s important to note that to me there seems to be a vocal contingent of us who give lower consideration to AI x-risk, at least on the forum, and I wonder if this represents a general trend. (epistemic status: low—I have no hard data to back this up besides the fact that there seem to be more pro-AI posts around here)
*I think any substantial (>0.01%) risk of extinction due to AI action in the next century warrants a total and “permanent” (>50 years) pause on all AI development, enforced through international law