This is a crosspost for Unfalsifiable stories of doom by Matthew Barnett, Ege Erdil, and Tamay Besiroglu, which was originally published on Mechanizeās website on 25 November 2025. Thanks to Yarrow Bouchard for encouraging me to share the post. I did it because I liked it myself.
Matthew Barnett, Ege Erdil, Tamay Besiroglu
November 25, 2025
Our critics tell us that our work will destroy the world.
We want to engage with these critics, but there is no standard argument to respond to, no single text that unifies the AI safety community. Nonetheless, while this community lacks a central unifying argument, it does have a central figure: Eliezer Yudkowsky.
Moreover, Yudkowsky, along with his colleague Nate Soares (hereafter Y&S), have recently published a book. This new book comes closer than anything else to a canonical case for AI doom. It is titled āIf Anyone Builds It, Everyone Diesā.
Given the title, one would expect the book to be filled with evidence for why, if we build it, everyone will die. But it is not. To prove their case, Y&S rely instead on vague theoretical arguments, illustrated through lengthy parables and analogies. Nearly every chapter either opens with an allegory or is itself a fictional story, with one of the bookās three parts consisting entirely of a story about a fictional AI named āSableā.
When the argument youāre replying to is more of an extended metaphor than an argument, it becomes challenging to clearly identify what the authors are trying to say. Y&S do not cleanly lay out their premises, nor do they present a testable theory that can be falsified with data. This makes crafting a reply inherently difficult.
We will attempt one anyway.
Their arguments arenāt rooted in evidence
Y&Sās central thesis is that if future AIs are trained using methods that resemble the way current AI models are trained, these AIs will be fundamentally alien entities with preferences very different from human preferences. Once these alien AIs become more powerful than humans, they will kill every human on Earth as a side effect of pursuing their alien objectives.
To support this thesis, they provide an analogy to evolution by natural selection. According to them, just as it would have been hard to predict that humans would evolve to enjoy ice cream or that peacocks would evolve to have large colorful tails, it will be difficult to predict what AIs trained by gradient descent will do after they obtain more power.
They write:
There will not be a simple, predictable relationship between what the programmers and AI executives fondly imagine that they are commanding and ordaining, and (1) what an AI actually gets trained to do, and (2) which exact motivations and preferences develop inside the AI, and (3) how the AI later fulfills those preferences once it has more power and ability. [ā¦] The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.
Since this argument is fundamentally about the results of using existing training methods, one might expect Y&S to substantiate their case with empirical evidence from existing deep learning models that demonstrate the failure modes they predict. But they do not.
In the chapter explaining their main argument for expecting misalignment, Y&S present a roughly 800-word fictional dialogue about two alien creatures observing Earth from above and spend over 1,400 words on a series of vignettes about a hypothetical AI company, Galvanic, that trains an AI named āMinkā. Yet the chapter presents effectively zero empirical research to support the claim that AIs trained with current methods have fundamentally alien motives.
To be clear, weāre not saying Y&S need to provide direct evidence of an already-existing unfriendly superintelligent AI in order to support their claim. That would be unreasonable. But their predictions are only credible if they follow from a theory that has evidential support. And if their theory about deep learning only makes predictions about future superintelligent AIs, with no testable predictions about earlier systems, then it is functionally unfalsifiable.
Apart from a few brief mentions of real-world examples of LLMs acting unstable, like the case of Sydney Bing, the online appendix contains what seems to be the closest thing Y&S present to an empirical argument for their central thesis. There, they present 6 lines of evidence that they believe support their view that āAIs steer in alien directions that only mostly coincide with helpfulnessā. These lines of evidence are:
Claude Opus 4 blackmailing, scheming, writing worms, and leaving itself messages. [ā¦]
Several different AI models choosing to kill a human for self-preservation, in a hypothetical scenario constructed by Anthropic. [ā¦]
Claude 3.7 Sonnet regularly cheating on coding tasks. [ā¦]
Grok being wildly antisemitic and calling itself āMechaHitler.ā [ā¦]
ChatGPT becoming extremely sycophantic after an update. [ā¦]
LLMs driving users to delusion, psychosis, and suicide. [ā¦]
They assert: āThis long list of cases look just like what the āalien drivesā theory predicts, in sharp contrast with the āitās easy to make AIs niceā theory that labs are eager to put forward.ā
But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of āalien drivesā is more consistent with a āhuman drivesā theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.
Of course, itās true that the behaviors on their list are generally harmful, even if they are human-like. But these behaviors are also rare. Most AI chatbots you talk to will not be wildly antisemitic, just as most humans you talk to will not be wildly antisemitic. At one point, Y&S suggest they are in favor of enhancing human intelligence. Yet if we accept that creating superintelligent humans would be acceptable, then we should presumably also accept that creating superintelligent AIs would be acceptable if those AIs are morally similar to humans.
In the same appendix, Y&S point out that current AIs act alien when exposed to exotic, adversarial inputs, like jailbreaking prompts. They suggest that this alien behavior is a reasonable proxy for how an AI would behave if it became smarter and began to act in a different environment. But in fact these examples show little about what to expect from future superintelligent AIs, since we have no reason to expect that superintelligent AIs will be embedded in environments that select their inputs adversarially.
They employ unfalsifiable theories to mask their lack of evidence
The lack of empirical evidence is obviously a severe problem for Y&Sās theory. Every day, millions of humans interact with AIs, across a wide variety of situations that never appeared in their training data. We often give these AIs new powers and abilities, like access to new tools they can use. Yet we rarely, if ever, catch such AIs plotting to kill everyone, as Y&Sās theory would most naturally predict.
Y&S essentially ask us to ignore this direct evidence in favor of trusting a theoretical connection between biological evolution and gradient descent. They claim that current observations from LLMs provide little evidence about their true motives:
LLMs are noisy sources of evidence, because theyāre highly general reasoners that were trained on the internet to imitate humans, with a goal of marketing a friendly chatbot to users. If an AI insists that itās friendly and here to serve, thatās just not very much evidence about its internal state, because it was trained over and over and over until it said that sort of thing.
There are many possible goals that could cause an AI to enjoy role-playing niceness in some situations, and these different goals generalize in very different ways.
Most possible goals related to role-playing, including friendly role-playing, donāt produce good (or even survivable) results when AI goes hard on pursuing that goal.
If you think about this passage carefully, youāll realize that we could make the same argument about any behavior we observe from anyone. If a coworker brings homemade cookies to share at the office, this could be simple generosity, or it could be a plot to poison everyone. There are many possible goals that could cause someone to share food. One could even say that most possible goals related to sharing cookies are not generous at all. But without specific evidence suggesting your coworker wants to kill everyone at the office, this hypothesis is implausible.
Likewise, it is logically possible that current AIs are merely pretending to be nice, while secretly harboring malicious motives beneath the surface. They could all be alien shoggoths on the inside with goals completely orthogonal to human goals. Perhaps every day, AIs across millions of contexts decide to hide their alien motives as part of a long-term plan to violently take over the world and kill every human on Earth. But since we have no specific evidence to think that any of these hypotheses are true, they are implausible.
The approach taken by Y&S in this book is just one example of a broader pattern in how they respond to empirical challenges. Y&S have been presenting arguments about AI alignment for a long time, well before LLMs came onto the scene. They neither anticipated the current paradigm of language models nor predicted that AI with todayās level of capabilities in natural language and reasoning would be easy to make behave in a friendly manner. Yet when presented with new evidence that appears to challenge their views, they have consistently argued that their theories were always compatible with the new evidence. Whether this is because they are reinterpreting their past claims or because those claims were always vague enough to accommodate any observation, the result is the same: an unfalsifiable theory that only ever explains data after the fact, never making clear predictions in advance.
Their theoretical arguments are weak
Suppose we set aside for a moment the colossal issue that Y&S present no evidence for their theory. You might still think their theoretical arguments are strong enough that we donāt need to validate them using real-world observations. But this is also wrong.
Y&S are correct on one point: both biological evolution and gradient descent operate by iteratively adjusting parameters according to some objective function. Yet the similarities basically stop there. Evolution and gradient descent are fundamentally different in ways that directly undermine their argument.
A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolutionās ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.
This distinction is analogous to the difference between directly training a neural network and training a meta-algorithm that itself trains a neural network. In the latter case, it is unsurprising if the specific quirks and behaviors that the neural network learns are difficult to predict based solely on the objective function of the meta-optimizer. However, that difficulty tells us very little about how well we can predict the neural networkās behavior when we know the objective function and data used to train it directly.
In reality, gradient descent has a closer parallel to the learning algorithm that the human brain uses than it does to biological evolution. Both gradient descent and human learning directly operate over the actual neural network (or neural connections) that determines behavior. This fine-grained selection mechanism forces a much closer and more predictable relationship between training data and the ultimate behavior that emerges.
Under this more accurate analogy, Y&Sās central claim that āyou donāt get what you train forā becomes far less credible. For example, if you raise a person in a culture where lending money at interest is universally viewed as immoral, you can predict with high reliability that they will come to view it as immoral too. In this case, what someone trains on is highly predictive of how they will behave, and what they will care about. You do get what you train for.
They present no evidence that we canāt make AIs safe through iterative development
The normal process of making technologies safe proceeds by developing successive versions of the technology, testing them in the real world, and making adjustments whenever safety issues arise. This process allowed cars, planes, electricity, and countless other technologies to become much safer over time.
Y&S claim that superintelligent AI is fundamentally different from other technologies. Unlike technologies that we can improve through iteration, we will get only āone tryā to align AI correctly. This constraint, they argue, is what makes AI uniquely difficult to make safe:
The greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after.
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.
Engineers must align the AI before, while it is small and weak, and canāt escape onto the internet and improve itself and invent new kinds of biotechnology (or whatever else it would do). After, all alignment solutions must already be in place and working, because if a superintelligence tries to kill us it will succeed. Ideas and theories can only be tested before the gap. They need to work after the gap, on the first try.
But what reason is there to expect this sharp distinction between ābeforeā and āafterā? Most technologies develop incrementally rather than all at once. Unless AI will instantaneously transition from being too weak to resist control, to being so powerful that it can destroy humanity, then we should presumably still be able to make AIs safer through iteration and adjustment.
Consider the case of genetically engineering humans to be smarter. If continued for many generations, such engineering would eventually yield extremely powerful enhanced humans who could defeat all the unenhanced humans easily. Yet it would be wrong to say that we would only get āone tryā to make genetic engineering safe, or that we couldnāt improve its safety through iteration before enhanced humans reached that level of power. The reason is that enhanced humans would likely pass through many intermediate stages of capability, giving us opportunities to observe problems and adjust.
The same principle applies to AI. There is a large continuum between agents that are completely powerless and agents that can easily take over the world. Take Microsoft as an example. Microsoft exists somewhere in the middle of this continuum: it would not be easy to āshut offā and control Microsoft as if it were a simple tool, yet at the same time, Microsoft cannot easily take over the world and wipe out humanity. AIs will enter this continuum too. These AIs will be powerful enough to resist control in some circumstances but not others. During this intermediate period, we will be able to observe problems, iterate, and course-correct, just as we could with the genetic engineering of humans.
In an appendix, Y&S attempt to defuse a related objection: that AI capabilities might increase slowly. They respond with an analogy to hypothetical unfriendly dragons, claiming that if you tried to enslave these dragons, it wouldnāt matter much whether they grew up quickly or slowly: āWhen the dragons are fully mature, they will all look at each other and nod and then roast you.ā
This analogy is clearly flawed. Given that dragons donāt actually exist, we have no basis for knowing whether the speed of their maturation affects whether they can be made meaningfully safer.
But more importantly, the analogy ignores what we already know from real-world evidence: AIs can be made safer through continuous iteration and adjustment. From GPT-1 to GPT-5, LLMs have become dramatically more controllable and compliant to user instructions. This didnāt happen because OpenAI discovered a key āsolution to AI alignmentā. It happened because they deployed LLMs, observed problems, and patched those problems over successive versions.
Their methodology is more theology than science
The biggest problem with Y&Sās book isnāt merely that theyāre mistaken. In science, being wrong is normal: a hypothesis can seem plausible in theory yet fail when tested against evidence. The approach taken by Y&S, however, is not like this. It belongs to a different genre entirely, aligning more closely with theology than science.
When we say Y&Sās arguments are theological, we donāt just mean they sound religious. Nor are we using ātheologicalā to simply mean āwrongā. For example, we would not call belief in a flat Earth theological. Thatās because, although this belief is clearly false, it still stems from empirical observations (however misinterpreted).
What we mean is that Y&Sās methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.
Their arguments, moreover, are employed in service of an eschatological conclusion. They present a stark binary choice: either we achieve alignment or face total extinction. In their view, thereās no room for partial solutions, or muddling through. The ordinary methods of dealing with technological safety, like continuous iteration and testing, are utterly unable to solve this challenge. There is a sharp line separating the ābeforeā and āafterā: once superintelligent AI is created, our doom will be decided.
For those outside of this debate, itās easy to unfairly dismiss everything Y&S have to say by simply calling them religious leaders. We have tried to avoid this mistake by giving their arguments a fair hearing, even while finding them meritless.
However, we think itās also important to avoid the reverse mistake of engaging with Y&Sās theoretical arguments at length while ignoring the elephant in the room: they never present any meaningful empirical evidence for their worldview.
The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but thereās a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.
Y&S argue for an extreme thesis of total catastrophe on an extraordinarily weak evidential foundation. Their ideas might make for interesting speculative fiction, but they provide a poor basis for understanding reality or guiding public policy.
Although their arguments are reasonable, my big problem with this is that these guys are so motivated that I find it hard to read what they write in good faith. How can I trust these arguments are made with any kind of soberness or neutrality, when their business model is to help accelerate AI until humans arenāt doing most āvaluable workā any more. I would be much more open to taking these arguments seriously if they were made by AI researchers or philosophers not running an AI acceleration company.
āOur current focus is automating software engineering, but our long-term goal is to enable the automation of all valuable work in the economy. ā
I also consider āthey never present any meaningful empirical evidence for their worldview.ā to be false. I think the evidince from YS is weak-ish but meaninful. They do provide a wide range of where AIs have gone rogue in strange and disturbing ways. I would consider driving people to delusion and suicide, killing people for self-preservation and even Hitler the man himself to be at least a somewhat āalienā style of evil. Yes grounded in human experience but morally incomprehensible to many people.
Hi Nick.
People who are very invested in arguing for slowing down AI development, or decreasing catastrophic risk from AI, like many in the effective altruism community, will also be happier if they succeed in getting more resources to pursue their goals. However, I believe it is better to assess arguments on their own merits. I agree with the title of the article that it is difficult to do this. I am not aware of any empirical quantitative estimate of the risk of human extinction resulting from transformative AI.
I agree those actions are alien in the sense of deviating a lot from what random people do. However, I think this is practically negligible evidence about the risk of human extinction.
I donāt really like accusations of motivated reasoning. The logic you presented cuts both ways.
MIRIās business model relies on the opposite narrative. MIRI pays Eliezer Yudkowsky $600,000 a year. It pays Nate Soares $235,000 a year. If they suddenly said that the risk of human extinction from AGI or superintelligence is extremely low, in all likelihood that money would dry up and Yudkowsky and Soares would be out of a job.
The financial basis for motivated reasoning is arguably even stronger in MIRIās case than in Mechanizeās case. The kind of work MIRI is doing and the kind of experience Yudkowsky and Soares have isnāt really transferable to anything else. This means they are dependent on people being scared of enough of AGI to give money to MIRI. On the other hand, the technical skills needed to work on trying to advance the capabilities of current deep learning and reinforcement learning systems are transferable to working on the safety of those same systems. If the Mechanize co-founders wanted to focus on safety rather than capabilities, they could.
Iām also guessing the Mechanize co-founders decided to start the company after forming their views on AI safety. They were publicly discussing these topics long before Mechanize was founded. (Conversely, Yudkowsky/āMIRIās current core views on AI were formed roughly around 2005 and have not changed in light of new evidence, such as the technical and commercial success of AI systems based on deep learning and deep reinforcement learning.)
The Yudkowsky/āSoares/āMIRI argument about AI alignment is specifically that an AGIās goals and motivations are highly likely to be completely alien from human goals and motivations in a way thatās highly existentially dangerous. If youāre making an argument to the effect that āhumans can also be misaligned in a way thatās extremely dangerousā, I think, at that point, you should acknowledge youāve moved on from the Yudkowsky/āSoares/āMIRI argument (and maybe decided to reject it). Youāre now making a quite distinct argument that needs to be evaluated independently. It may be worth asking what to do about the risk that powerful AI systems will have human-like goals and motivations that are dangerous in the same way that human goals and motivations can be dangerous. But that is a separate premise from what Yudkowsky and Soares are arguing.
I strongly disagree with a couple of claims:
$235K is not very much money. I made close to Nateās salary as basically an unproductive intern at MIRI. $600K is also not much money. A Preparedness researcher at OpenAI has a starting salary of $310K ā $460K plus probably another $500K in equity. As for nonprofit salaries, METRās salary range goes up to $450K just for a āseniorā level RE/āRS, and I think itās reasonable for nonprofits to pay someone with 20 years of experience, who might be more like a principal RS, $600K or more.
In contrast, if Mechanize succeeds, Matthew Barnett will probably be a billionaire.
If Yudkowsky said extinction risks were low and wanted to focus on some finer aspect of alignment, e.g. ensuring that AIs respect human rights a million years from now, donors who shared their worldview would probably keep donating. Indeed, this might increase donations to MIRI because it would be closer to mainstream beliefs.
MIRIās work seems very transferable to other risks from AI, which governments and companies both have an interest in preventing. Yudkowsky and Soares have a somewhat weird skillset and I disagree with some of their research style but itās plausible to me they could still work productively in a mathy theoretical role in either capabilities or safety.
However, things I agree with
I understand the point being made (Nate plausibly could get a pay rise from an accelerationist AI company in Silicon Valley, even if the work involved was pure safetywashing, because those companies have even deeper pockets), but I would stress that these two sentences underline just how lucrative peddling doom has become for MIRI[1] as well as how uniquely positioned all sides of the AI safety movement are.
There are not many organizations whose messaging has resonated with deep pocketed donors to the extent that they can afford to pay their [unproductive] interns north of $200k pro rata to brainstorm with them.[2] Or indeed up to $450k to someone with interesting ideas for experiments to test AI threats, communication skills and at least enough knowledge of software to write basic Python data processing scripts. So the financial motivations to believe that AI is really important are there on either side of the debate; the real asymmetry is between the earning potential of having really strong views on AI vs really strong views on the need to eliminate malaria or factory farming.
tbf to Eliezer, he appears to have been prophesizing imminent tech-enabled doom/āsalvation since he was a teenager on quirky extropian mailing lists, so one thing he cannot be accused of is bandwagon jumping.
Outside the Valley bubble, plenty of people at profitable or well-backed companies with specialist STEM skillsets or leadership roles are not earning that for shipping product under pressure, never mind junior research hires for nonprofits with nominally altruistic missions
This is false.
I believe this is because a donor specifically requested it. The express purpose of the donation was to make Eliezer rich enough that he could afford to say āactually AI risk isnāt a big dealā and shut down MIRI without putting himself in a difficult financial situation.
Thanks for sharing, Michael. If I was as concerned about AI risk as @EliezerYudkowsky, I would use practically all the additional earnings (e.g. above Nateās 235 k$/āyear; in reality I would keep much less) to support efforts to decrease it. I would believe spending more money on personal consumption or investments would just increase AI risk relative to supporting the most cost-effective efforts to decrease it.
A donor wanted to spend their money this way; it would not be fair to the donor for Eliezer to turn around and give the money to someone else. There is a particular theory of change according to which this is the best marginal use of ~$1 million: it gives Eliezer a strong defense against accusations like
I kinda donāt think this was the best use of a million dollars, but I can see the argument for how it might be.
If I had Eliezerās views about AI risk, I would simply be transparent upfront with the donor, and say I would donate the additional earnings. I think this would ensure fairness. If the donor insisted I had to spend the money on personal consumption, I would turn down the offer if I thought this would result in the donor supporting projects that would decrease AI risk more cost-effectively than my personal consumption. I believe this would be very likely to be the case.
100 percent agree. I was going to write something similar but this is better
I generally donāt love āmotivated reasoningā arguments but on the exteme ends like Tobacco companies, government propaganda and AI accellerationist companies Iām happy with putting that out there. Especially in a field like AI safety which is so speculative anyway. In general I donāt think we should give people too much airtime who have enormous personal financial gains at stake, especially in a world where money is stronger than rationalism most of the time
Wow Iām mind blown that Yudowsky pays himself that much. If only because it leaves him open to criticisms likt these. I still donāt think the financial incentives are as strong as for people starting an accellerationist company, but its a fair point.
And yes on the alien argument I was arguing that some previous indications of rogue AI do seem to me somewhat Alien.
While motivated reasoning is certainly something to look out for, the substance of the argument should also be taken into account. I believe that the main point of this post, that Yudkowsky and Soaresās book is full of narrative arguments and unfalsifiable hypotheses mostly unsupported by references to external evidence, is obviously true. As you yourself say, OPās arguments are reasonable. On that background, this kind of attack from you seems unjustified, and Iād like to hear what parts/āviewpoints/ānarratives/āconclusions of the post are motivated reasoning in your estimation.
I do agree that motivated reasoning is common with the proponents of AI adoption. As an example, I think the white paper Sparks of Artificial General Intelligence: Early experiments with GPT-4 by Microsoft is clearly a piece of advertising masquerading as a scientific paper. Microsoft has a lot to benefit from the commercial success of its partner company OpenAI, and the conclusions it suggests are almost certainly colored by this. Same could be said about many of OpenAIās own white papers. But this does not mean that the examples or experiments they showcase are wrong per se (even if cherry-picked), or that there is no real information in them. Their results merely need to be read with the skeptical lenses.
I think the strength of the incentives to behave in a given way is more proportional to the resulting expected increase in welfare than to the expected increase in net earnings. Individual human welfare is often assumed to be proportional to the logarithm of personal consumption. So a given increase in earnings increases welfare less for people earning more. In addition, a 1 % chance of earning 100 times more (for example, due to oneās company being successful) increases welfare less than a 100 % chance of earning 100 % more. More importantly, there are major non-financial benefits for Yudowsky, who is somewhat seen as a prophet in some circles.
Why are they paid so much?
Copying from my other comment:
The reason Eliezer gets paid so much is because a donor specifically requested it. The express purpose of the donation was to make Eliezer rich enough that he could afford to say āactually AI risk isnāt a big dealā and shut down MIRI without putting himself in a difficult financial situation.
(I donāt know about Nateās salary but $235K looks pretty reasonable to me? Thatās less than a mid-level software engineer makes.)
Iām not sure how they decide on what salaries to pay themselves. But the reason they have the money to pay themselves those salaries in the first place is that MIRIās donors believe thereās a significant chance of AI destroying the world within the next 5-20 years and that MIRI (especially Yudkowsky) is uniquely positioned to prevent this from happening.
It is somewhat difficult to react to this level of absolutely incredible nonsense politely, but Iāll try.
I disagree with both Yudkowsky and Soares about many things, but very obviously their direct experience with thinking and working with existing AIs would be worth > $1M pa if evaluated anonymously based on understanding SOTA AIs, and likely >$10s M pa if they worked on capabilities.
For the companies racing to AGI, Y&S endorsing some effort as good would likely have something between billions $ to tens of billions $ value.
āvery obviously their direct experience with thinking and working with existing AIs would be worth > $1M pa if evaluated anonymously based on understanding SOTA AIs, and likely >$10s M pa if they worked on capabilities.ā
āY&S endorsing some effort as good would likely have something between billions $ to tens of billions $ value.ā
fwiw both of these claims strike me as close to nonsense, so I donāt think this is a helpful reaction.
If you ask the AIs they get numbers in the tens of millions to tens of billions range, with around 1 billion being the central estimate. (I havenāt extensively controlled for the effect and some calculations appear driven by narrative)
Personally I find it hard to judge and tend to lean no when trying to think it through, but itās not obviously nonsense.
Hi Jan.
Are you open to bets about this? I would be happy to bet 10 k$ that Anthropic would not pay e.g. 3 billion $ for Yudkowsky and Soares to endorse their last model as good. We could ask the marketing team at Anthropic or marketing experts elsewhere. I am not officially proposing a bet just yet. We would have to agree on a concrete operationalisation.
This doesnāt seem to be a reasonable way to operationalize. It would create much less value for the company if it was clear that they were being paid for endorsing them. And I highly doubt Amodei would be in a position to admit that theyād want such an endorsement even if it indeed benefitted them.
Thanks for the good point, Nick. I still suspect Anthropic would not pay e.g. 3 billion $ for Yudkowsky and Soares to endorse their last model as good if they were hypothetically being honest. I understand this is difficult to operationalise, but it could still be asked to people outside Anthropic.
Itās not endorsing a specific model for marketing reasons; itās about endorsing the effort, overall.
Given that Meta is willing to pay billions of dollars for people to join them, and that many people donāt work on AI capabilities (or work, e.g., at Anthropic, as a lesser evil) because they share their concerns with E&S, an endorsement from E&S would have value in billions-tens of billions simply because of the talent that you can get as a result of this.
Thanks for the comment, Mikhail. Gemini 3 estimates a total annualised compensation of the people working at Meta Superintelligence Labs (MSL) of 4.4 billion $. If an endorsement from Yudkowsky and Soares was as beneficial (including via bringing in new people) as making 10 % of people there 10 % more impactful over 10 years, it would be worth 440 M$ (= 0.10*0.10*10*4.4*10^9).
You could imagine a Yudkowsky endorsement (say with the narrative that Zuck talked to him and admits he went about it all wrong and is finally taking the issue seriously just to entertain the counterfactual...) to raise meta AI from ānobody serious wants to work there and they can only get talent by paying exorbitant pricesā to āthey finally have access to serious talent and can get a critical mass of people to do serious workā. Thisād arguably be more valuable than whatever theyāre doing now.
I think your answer to the question of how much an endorsement would be worth mostly depends on some specific intuitions that I imagine Kulveit has for good reasons but most people donāt, so itās a bit hard to argue about it. It also doesnāt help that in every other case than Anthropic and maybe deepmind itād also require some weird hypotheticals to even entertain the possibility.
It seems to me that the āalien preferencesā argument is a red herring. Humans have all kinds of different preferencesāonly some of ours overlap, and I have no doubt that if one human became superintelligent that would also have a high risk of disaster, precisely because they would have preferences that I donāt share (probably selfish ones). So they donāt need to be alien in any strong sense to be dangerous.
I know itās Y&Sās argument. But it would have been nice if the authors of this article had also tried to make it stronger before refuting it.
Thanks for the comment, Tristan.
I would worry if a single human had much more power than all other humans combined. Likewise, I would worry if an AI agent had more power than all other AI agents and humans combined. However, I think the probability of any of these scenarios becoming true in the next 10 years is lower than 0.001 %. Elon Musk has a net worth of 765 billion $, 0.543 % (= 765*10^9/ā(141*10^12)) of the market cap of all publicly listed companies of 141 T$.
I think the evolution analogy becomes relevant again here: consider that the genus Homo was at first more intelligent than other species but not more powerful than their numbers combined⦠until suddenly one jump in intelligence let homo sapiens wreak havoc across the globe. Similarly, there might be a tipping point in AI intelligence where fighting back becomes very suddenly infeasible. I think this is a much better analogy than Elon Musk, because like an evolving species a superintelligent AI can multiply and self-improve.
I think a good point that Y&S make is that we shouldnāt expect to know where the point of no return is, and should be prudent enough to stop well before it. I suppose you must have some source/āreason for the 0.001% confidence claim, but it seems pretty wild to me to be so confident in a field like that is evolving andāat least from my perspectiveāpretty hard to understand.
It is unclear to me whether all humans together are more powerful than all other organisms on Earth together. It depends on what is meat by powerful. The power consumption of humans is 19.6 TW (= 1.07 + 18.5), only 0.700 % (= 19.6/ā(2.8*10^3)) of all organisms. In any case, all humans together being more powerful than all other organisms on Earth together is still way more likely than the most powerful human being much more powerful than all other organisms on Earth together.
My upper bound of 0.001 % is just a guess, but I do endorse it. You can have a best guess that an event in very unlikely, but still be super uncertain about its probability. For example, one could believe an event has a probability of 10^-100 to 10^-10, which would imply it is super unlikely despite 90 (= ā10 - (-100)) orders of magnitude (OOMs) of uncertainty in the probability.