There seem to be two main framings emerging from recent AGI x-risk discussion: default doom, given AGI, and default we’re fine, given AGI.
I’m interested in what people who have low p(doom|AGI) think are the reasons that things will basically be fine once we have AGI (or TAI, PASTA, ASI). What mechanisms are at play? How is alignment solved so that there are 0 failure modes? Can we survive despite imperfect alignment? How? Is alignment moot? Will physical limits be reached before there is too much danger?
If you have high enough p(doom|AGI) to be very concerned, but you’re still only at ~1-10%, what is happening in the other 90-99%?
Added 22Apr: I’m also interested in detailed scenarios and stories, spelling out how things go right post-AGI. There are plenty of stories and scenarios illustrating doom. Where are the similar stories illustrating how things go right? There is the FLI World Building Contest, but that took place in the pre-GPT-4+AutoGPT era. The winning entry has everyone acting far too sensibly in terms of self-regulation and restraint. I think we can now say, given the fervour over AutoGPT, that this will not happen, with high likelihood.
If we define “doom” as “some AI(s) take over the world suddenly without our consent, and then quickly kill everyone” then my p(doom) is in the single digits. (If we define it as human extinction or disempowerment more generally, regardless of the cause, then I have a higher probability, especially over very long time horizons.)
The scenario that I find most likely in the future looks like this:
Although AI gets progressively better at virtually all capabilities, there aren’t any further sudden jumps in general AI capabilities, much greater in size than the jump from GPT-3.5 to GPT-4. (Which isn’t to say that AI progress will be slow.)
As a result of (1), researchers roughly know what AI is capable of doing in the near-term future, and how it poses a risk to us during the relevant parts of the takeoff.
Before AI becomes dangerous enough to pose a takeover risk, politicians pass legislation regulating AI deployment. This legislation is wide reaching, and allows the government to extensively monitor compute resources and software.
AI labs spend considerable amounts of money (many billions of dollars) on AI safety to ensure they can pass government audits, and win the public’s trust. (Another possibility is that foundation model development is taken over by the government.)
People are generally very cautious about deploying AI in safety-critical situations. AI takes over important management roles only after people become very comfortable with the technology.
AI alignment is not an intractable problem, and SGD naturally finds human-friendly agents when rewarding models for good behavior in an exhaustive set of situations that they might reasonably expect to encounter. This is especially true when combined with whatever clever tricks we come up with in the future. While catastrophic forms of deception are compatible with the behavior we reward AIs for, it is usually simpler to just “be good” than to lie.
Even though some value misalignment slips through the cracks after auditing AIs using e.g. mechanistic interpretability tools, AI misalignment is typically slight rather than severe. This means that most AIs aren’t interested in killing all humans.
Eventually, competitive pressures force people to adopt AIs to automate just about every possible type of job, including management, giving AIs effective control of the world.
Humans retire as their wages fall to near zero. A large welfare state is constructed to pay income to people who did not own capital prior to the AI transition period. Even though inequality becomes very high after this transition, the vast majority of people become better off in material terms than the norm in 2023.
The world continues to evolve with AIs in control. Even though humans are retired, history has not yet ended.
I’d be curious about what happens after 10. How long so biological humans survive? How long can they said to be “in control” of AI systems such that some group of humans could change the direction of civilization if they wanted to? How likely is deliberate misuse of AI to cause an existential catastrophe, relative to slowly losing control of society? What are the positive visions of the future, and which are the most negative?
Seconding this. This future scenario as constructed seems brittle to subtle forms of misalignment[1] erasing nearly all future value (i.e. still an existential catastrophe even if not a sudden extinction event).
Note this seems somewhat similar to Yuval Harari’s worries voiced in Homo Deus.
Looks like Matthew did post a model of doom that contains something like this (back in May, before the top level comment:
I think that given the possibility of brain emulation, the division between AIs and humans you are drawing here may not be so clear in the longer term. Does that play into your model at all, or do you expect that even human emulations with various cognitive upgrades will be totally unable to compete with pure AIs?
I don’t expect human brain emulations to be competitive with pure software AI. The main reason is that by the time we have the ability to simulate the human brain, I expect our AIs will already be better than humans at almost any cognitive task. We still haven’t simulated the simplest of organisms, and there are some good a priori reasons to think that software is easier to improve than brain emulation technology.
I definitely think we could try to merge with AIs to try to keep up with the pace of the world in general, but I don’t think this approach would allow us to surpass ordinary software progress.
I agree with you that pure software AGI is very likely to happen sooner than brain emulation.
I’m wondering about your scenario for the farther future, near the point when humans start to retire from all jobs. I think that at this point, many humans would be understandably afraid of the idea that AIs could take over. People are not stupid and many are obsessed with security. At this point, brain emulation would be possible. It seems to me that there would therefore be large efforts in making those emulations competitive with pure software AI in important ways (not all ways of course, but some important ones, involving things like judgment). Possibly involving regulation to aid this process. Of course it is just a guess, but it seems likely to me that this would work to some extent. However, this may stretch the definition of what we currently consider a human in some ways.
My estimates aren’t low, I think there’s very roughly about 40% chance we’ll die because of AI this century. But here are some reasons why it isn’t higher:
Creating a species vastly more intelligent than yourself seems highly unusual, nothing like it has happened before, so there need to be very good arguments for why it’s possible.
Having a species completely kill all other species is also very unusual, so there need to be very good arguments for why that would happen.
Perhaps AGI won’t be utility maximizing, LLMs don’t seem to be very maximizing. If it has a good model of the world maybe it’ll understand what we want and just give us that.
Perhaps we’ll convince the world to slow down AI capabilities research and solve alignment in time.
There are good counterarguments to these which is why my p(doom) is so high, but they still add to my uncertainty.
In response to point 2 - if you see human civilization continuing to develop indefinitely without regard for other species, wouldn’t other species be all extinct, except for maybe a select few?
“without regard for other species” is doing a lot of work
Other species are instrumentally very useful to humans, providing ecosystem functions, food, and sources of material (including genetic material).
On the AI side, it seems possible that a powerful misaligned AGI would find ecosystems and/or biological materials valuable, or that it would be cheaper to use humans for some tasks than machines. I think these factors would raise the odds that some humans (or human-adjacent engineered beings) survive in worlds dominated by such an AGI.
This seems pretty unlikely to me
People will continue to prefer controllable to uncontrollable AI and continue to make at least a commonsense level of investment in controllability; that is, they invest as much as naively warranted by recent experience and short term expectations, which is less than warranted by a sophisticated assessment of uncertainty about misalignment, though the two may converge as “recent experience” involves more and more capable AIs. I think this minimal level of investment in control is very likely (99%+).
Next, the proposed sudden/surprising phase transition that breaks controllability properties never materialises so that commonsense investment turns out to be enough for an OK outcome. I think this is about 65%.
Next, AI-enhanced human politics also manages to generate an OK outcome. About 70%.
That’s 45%, but the bar is perhaps higher than you have in mind (I’m also counting non misalignment paths to bad outcomes). There are also worlds where the problem is harder but more is invested & still ends up being enough. Not sure how much weight goes there, somewhat less.
Don’t take these probabilities too seriously.
Thanks for answering! The only reason AI is currently controllable is that it is weaker than us. All the GPT-4 jailbreaks show how high the uncontollability potential is, so I don’t think a phase-transition is necessary as we still far from AI being controllable in the first place.
It cannot both be controllable because it’s weak and also uncontrollabile.
That said, I expect more advanced techniques will be needed for more advanced AI; I just think control techniques probably keep up without sudden changes in control requirements.
Also LLMs are more controllable than weaker older designs (compare GPT4 vs Tay).
Yes. This is no comfort for me in terms of p(doom|AGI). There will be sudden changes in control requirements, judging by the big leaps of capability between GPT generations.
More controllable is one thing, but it doesn’t really matter much for reducing x-risk when the numbers being talked about are “29%”.
That’s what I meant by “phase transition”
The framing here seems really strange to me. You seem to have a strong prior that doom happens, while, to me, most arguments for doom require quite a few hypothesis to be true and hence their conjunction is a priori unlikely. I guess I don’t find the inside view arguments very persuasive to majorly update, much like the median to AI experts, who are around 2%.
To go into your questions specifically.
AGI is closer to a very intelligent human than to a naive optimiser.
I don’t see why this is required, I’m not arguing p(doom) is 0.
AGI either can’t or “chooses” not to cause an x-risk.
What makes you think this? To me it just sounds like ungrounded anthropomorphism.
Sure, it’s uncertain. But we’re 100% of the reference class of general intelligences. Most AI scenarios seem to lean very heavily on the “naive optimiser” though which means low-ish credence in them a priori.
In reality, I guess both these views are wrong or it’s a spectrum with AGI somewhere along it.
We are a product of billions of years of biological evolution. AI shares none of that history in its architecture. Are all planets habitable like the Earth? Apply the Copernican Revolution to mindspace.
Yes, high uncertainty here. Problem is that your credence on AI being a strong optimiser is a ceiling on p(doom| AGI) under every scenario I’ve read
What makes you think it’s unlikely that strong optimisers will come about?
Prior: most specific hypotheses are wrong. Update: we don’t have strong evidence in any direction. Conclusion: more likely than not this is wrong.
This seems like a case of different prior distributions. I think it’s a specific hypothesis to say that strong optimisers won’t happen (i.e. there has to be a specific reason for this, otherwise it’s the default, for convergent instrumental reasons).
Depends on what you mean by doom: I don’t think we’re getting paperclipped, but I don’t think things will be ok by default either.
I expect early AI agents to be purpose-built ensembles of bleeding-edge predictive models glued together by a bunch of crude hacks. Whatever gestalt entity emerges out of their internal dynamics is going to be only weakly agential: more like a corporation than a human being.
An individual agent like this could not defeat all of us together. A competitive ecosystem of them, however, can and will defeat all of us separately—Moloch made silicon. Most of us are going to make it, at least in the short term, but the window in which we might have steered human civilization towards human ends will shut forever.
By doom I mean there are no humans left. How does your scenario not lead to that eventually? Silicon Moloch is likely to foom, surely? (Cf. the “rapid dose of history” that will happen.) What’s to stop the GPT-6+AutoGPT+plugins agent-like economy developing GPT-7, and so on?
Eventually, it probably does. But the same is true of our current trajectory. I think the sort of AI we seem likely to get first raises the short run p(doom) somewhat, but primarily by intensifying existing risks.
In the limit of maximally intense competition (which we won’t see, but might approach), probably not. We’d get a thousand years worth of incremental software improvements in a decade and a (slower) takeoff in chip production, but not runaway technological progress: in principle the risk free rate will eventually drop to a rate where basic research becomes economical, but I expect the ascended economy will go the way of all feedback loops and undermine its own preconditions long before that.
What sort of civilization comes out the other end is anyone’s guess, but I doubt it’ll be less equipped to protect itself than we are.
And you’re saying this isn’t enough for foom? We’ve had ~50 years of software development so far and gone from 0 to GPT-4.
I expect that
Riding transformer scaling laws all the way to the end of the internet still only gets you something at most moderately superhuman. This would be civilization-of-immortal-geniuses dangerous, but not angry-alien-god dangerous: MAD is still in effect, for instance. No nanotech, no psychohistory.
In particular, they won’t be smart enough to determine whether an alternative architecture can go Foom a priori.
Foom candidates will not be many orders of magnitude cheaper to train than mature language models
and that as a result the marginal return on trying to go Foom will be zero. If it happens, it’ll be the result of deliberate effort by an agent with lots and lots of slack to burn, not something that accidentally falls out of market dynamics.
and a 10,000,000-fold increase in transistor density. We might return to 20th century compute cost improvements for a bit, if things get really really cheap, but it’s not going to move anywhere near as fast as software.
What about to the limits of data capture? There’s still many orders of magnitude more data that could be collected—imagine all the billions of cameras in the world recording video 24⁄7 for a start. Or the limits of data generation? There are already companies creating sythetic data for training ML models.
There’s probably at least another 100-fold hardware overhang in terms of under-utilised compute that could be immediately exploited by AI; much more if all GPUs/TPUs are consolidated for big training runs.
Also, you know those uncanny ads you get that are related to what you were just talking about? Google is likely already capturing more spoken words per day from phone mic recording than were used in the entirety of the GPT-4 training set (~10^12).
I can think of a few scenarios where AGI doesn’t kill us.
AGI does not act as a rational agent. The predicted doom scenarios rely on the AGI acting as a rational agent that maximises a utility function at all costs. This behaviour has not been seen in nature. Instead, all intelligences (natural or artificial) have some degree of laziness, which results in them being less destructive. Assuming the orthogonality thesis is true, this is unlikely to change.
The AGI sees humans as more useful alive than dead, probably because it’s utility function involves humans somehow. This covers a lot of scenarios from horrible dystopias where AGI tortures us constantly to see how we react all the way to us actually somehow getting alignment right on the first try. It keeps us alive for the same reason as why we keep out pets alive.
The first A”G”I’s are actually just a bunch of narrow AI’s in a trenchcoat, and no one of them is able to overthrow humanity. A lot of recent advances in AI (including GPT4) have been propelled by a move away from generality and towards a “mixture of experts” model, where complex tasks are split into simpler ones. If this scales, one could expect more advanced systems to still not be general enough to act autonomously in a way that overpowers humanity.
AGI can’t self improve because it runs face-first into the alignment problem! If we can think of how creating an intelligence greater than us results in the alignment problem, so can AGI. An AGI that fears creating something more powerful than itself will not do that, resulting in it remaining at around human level. Such an AGI would not be strong enough to beat all of humanity combined, so it will be smart enough not to try.
Species aren’t lazy (those who are—or would be—are outcompeted by those who aren’t).
The pets scenario is basically an existential catastrophe by other means (who wants to be a pet that is a caricature of a human like a pug is to a wolf?). And obviously so is the torture/dystopia one (i.e. not an “OK outcome”). What mechanism would allow us to get alignment right on the first try?
This seems like a very unstable equilibrium. All that is needed is for one of the experts to be as good as Ilya Sutskever at AI Engineering, to get past that bottleneck in short order (speed and millions of instances run at once) and foom to ASI.
It would also need to stop all other AGIs who are less cautious, and be ahead of them when self-improvement becomes possible. Seems unlikely given current race dynamics. And even if this does happen, unless it was very aligned to humanity it still spells doom for us due to the speed advantage of the AGI and it’s different substrate needs (i.e. it’s ideal operating environment isn’t survivable for us).
Some discussion between me and Rohin Shah here (which hasn’t led me to shift my p(doom|AGI) yet).
I think there are many different ways things can go down that will all result in no doom. I’m still concerned that some of them could involve large amounts of collateral damage. Below I will sketch out 8 different scenarios that result in humanity surviving:
1.The most likely scenario for safety in my mind is that what we call an AGI will not live up to the hype. Ie, it will do some very impressive things, but will still retain significant flaws and make frequent mistakes, which render it incapable of world domination.
2.Similar to the above, it might be that conquering humanity becomes an impossibly difficult task, perhaps due to enhanced monitoring and anti-AI enforcement mechanisms.
3.Another scenario is that the first AGI we build will not be malicious, and we use those AI’s to monitor and defeat rogue AI’s.
4.Another scenario is that using a bunch of hyper-specialised “narrow” AI turns out to be better than a “general” AI, so research into the latter is essentially abandoned.
5.Another scenario is that solving alignment is a necessary step on the road to AGI. “getting the computer to do what you want” is a key part of all programming, it may be the key question that needs to be solved to even get to “AGI”.
6.Another scenario is that solving alignment turns out to be surprisingly easy, so everyone just does it.
7.Another scenario is a very high frequency of warning shots. AI does not need to be capable of winning to go rogue. It could be mistaken about it’s beliefs, or think that a small probability of success is “worth it”. A few high profile disasters might be more than enough to get the world onboard with banning AGI entirely.
8.Another scenario is that we don’t end up with enough compute power to actually run an AGI, so it doesn’t happen.
I would bet there are plenty more possible scenarios out there.
Thanks.
1. The premise is that we have AGI (i.e. p(doom|AGI)).
2. What do the anti-AI enforcement mechanisms look like? How are they better than the military-grade cybersecurity that we currently have (that is hardly watertight)?
3. How is the AGI (spontaneously?) not dangerous? (Malicious is a loaded word here: ~all the danger comes from it being indifferent. “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”)
4. This seems highly unlikely given the success of GPT-4, and multimodal foundation models (expect “GATO 2” or similar from Google DeepMind in the next few months—a single model able to handle text, images, video and robotics).
5. I don’t see how that follows? Or at least how you get perfect (non-doom level) alignment. So far releases have happened when the alignment is “good enough”. But it is only good enough because the models are still weaker than us.
6. Why do people think this? Where are the major breakthroughs getting us closer to this? There is even a nascent field working on proofs for alignment’s theoretical impossibility.
7. I hope, if there are warning shots, that this happens! Ideally it should be at least Paused pre-emptively, to give alignment research time to catch up.
8. Seems unlikely. We’re probably in a hardware overhang situation already. The limit may well be just having enough money in one place to spend on the compute ($10B?).
No worries! Here are some responses to your responses.
This scenario is not saying AGI won’t exist, just that they won’t be powerful enough to conquer humanity. As a simple counterexample, if you did a mind upload of a flat earther, that would count as AGI, but they would be too mentally flawed to kill us all.
This scenario would be most likely if an AI plan to conquer the world was highly difficult and required amassing large amounts of resources and power. (I suspect this is the case). In this case, such power could be opposed effectively via conventional means, the same way we prevent rogue states such as north korea from dominating the world.
Perhaps I should have used a different word than “alignment” here. I expect it to be potentially dangerous, but not world-ending. I’m working on a post about this, but turning every atom into a single goal is characteristic of a fanatical global maximiser, which does not describe neural networks or any existing intelligence, and is as such ridiculously unlikely. All that is required is that the AI is aligned enough with us to not want to commit global genocide.
Compare the performance of GPT-4 when playing chess to that of stockfish. Stockfish is massively superior to any human being, while GPT still can’t even learn the rules, despite having access to the same computing power and data. Since specialised AI are superior now, it’s quite possible they will continue as such in the future.
I don’t know what the architecture of future AI will look like, so it’s hard to speculate on this point. But misalignment definitely hurts the goals of AI companies. A premature rogue AI that kills lots of people would get companies sued into oblivion.
Again, I want to emphasise that there is a large gulf between “AI shares literally all of our values and never harms anyone” and “AI is aligned enough to not commit global genocide of all humanity”. I find it incredibly hard to believe that the latter is “theoretically impossible”, although I would be unsurprised if the former is.
People are already primed to be distrustful of AI and the people making it, so global regulation of some kind on AI is probably inevitable. The severity will depend on the severity of “warning shots”.
I just wouldn’t be sure on this one. I don’t think current designs are anywhere close to AGI’s, and I don’t think we know enough about the requirements for AGI to state confidently that we are in an overhang.
I wouldn’t say I’m confident in all of these scenarios being likely, but I’m virtually certain that at least one of them is, or another one I haven’t thought of yet. In general, I believe that the capabilities of AGI’s are being drastically overstated, and they will turn out to have flaws like literally every technology ever invented. I also believe that conquering humanity is a ridiculously difficult task, and probably fairly easy to foil.
A mind upload of a flat-Earther might not stay a flat-Earther for long if they are given access to all the world’s sensors (including orbiting cameras), and had 1000s of subjective years to think every month.
North Korean hackers have stolen billions of dollars. Imagine if there were a million times more of them. And that is mere human-level we’re talking about.
How do you get it aligned enough to not want to commit global genocide? Sounds like you’ve solved 99% of the alignment problem if you can do that!
I thought GPT-4 had learned the rules of chess? Pretty impressive for just being trained on text (show’s it has emergent internal world models).
I think you can assume that the architecture is basically “foundation transformer model / LLM” at this point. As Connor Leahy says, they are basically “general cognition engines” and will scale to full AGI in a generation or two (and with the addition of various plugins etc to aid “System 2” type thinking, which are now freely being offered by the AutoGPT enthusiasts and Open AI). We may or may not get such warning shots (look out for Google DeepMind’s next multimodal model I guess..)
I don’t think there’s as much of a gulf as there appears on the face of it. I think you are anthropomorphising to think that it will care about scale of harms in such a way (without being perfectly aligned). See also: Mesa-optimisation leading to value drift. The AI needs to be aligned on not committing global genocide indefinitely.
Hope so! And hope we don’t need any (more) lethal warning shots for it to happen. My worry is that we have very little time to get the regulation in place (hence working to try and speed that up).
I’m not sure, but I’m sure enough to be really concerned! What about the current architecture (“general cognition engine”) + AutoGPT + plugins, isn’t enough? And 100x GPT-4 of compute costs <1% of Google or Microsoft’s market capitalisation.
Even if you don’t think x-risk is likely, if you think a global catastrophe still is, I hope you can get behind calls for regulation.
Okay. But what then stops someone from creating an AGI that does? The game doesn’t end after one turn.
This seems extremely unlikely. In terms of things I can think of, if I were an AGI, I’d just infect every device with malware, and then control the entire flow of information. This could easily be done by reality warping humans (giving them a false feed of information that completely distorts their model of reality) or simply shutting them out—humans can’t coordinate if they can’t communicate on a mass scale. This would be really, really easy. Human society is incredibly fragile. We only don’t realize this because we’ve never had a concerted, competent attempt to actually break it.
This sounds like summoning Godzilla to fight MechaGodzilla: https://www.lesswrong.com/posts/DwqgLXn5qYC7GqExF/godzilla-strategies
The current trend is that generalization is superior, and will probably continue to be so.
With the way current models are designed, this seems extremely unlikely.
This also seems unlikely, given how many have tried, and we’re still nowhere close to solving it.
This is the most plausible of these. But the scale of the tragedy might be extremely high. Breakdown of communication, loss of supply chains, mass starvations, economic collapse, international warfare, etc. Even if it’s not extinction, I’m not sure how many shocks current civilization can endure.
This could maybe slow it, but not for long. I imagine there are far more efficient ways of running an AGI that someone would learn to implement.
Have you heard of this thing called “all of human history before the computer age”? Human coordination and civilization do not require hackable devices to operate. This plan would be exposed in 10 seconds flat as soon as people started talking to each other and comparing notes.
In general, I think that the main issue is a ridiculously overinflated idea of what “AGI” will actually be capable of. When something doesn’t exist yet, it’s easy to imagine it as having no flaws. But that invariably never turns out to be the case.
Yeah, once upon a time, but now our civilization is interconnected and dependent on the computer age. And they wouldn’t even have to realize they needed to coordinate.
How would they do that? They’ve lost control of communication. Expose it to who?
I am at high P(doom|AGI pre-2035), but not at near-certainty. Say, 75% but not 99.9%.
The reason for that is that I find both “fast takeoff takeover” and “continous multipolar takeoff” scenarios plausible (with no decisive evidence for one or the other). In “continuous multipolar takeoff”, you still get superintelligences running around. However, they would be “superintelligent with respect civilization-2023″ but not necessarily wrt civilization-then. And for the standard somewhat-well-thought-out AI takeover arguments to aply, you need to be superintelligent wrt civilization-then.
Two disclaimers: (1) Just because you don’t get discontinuity in influence around human level does not mean you can’t get it later. In my book, world can look “Christiano-like”, until suddenly it looks “Yudkowsky-like”. (2) Even if we never get AI singleton, things can still go horribly wrong (ie, Christiano’s what failure looks like). But imo those scenarios are much harder to reason about, and we have haven’t thought them out in enough detail to justify high certainty of either outcome.
My intuitive aggregation of this gives, say, 80% P(doom this century|AGI pre-2035). On top of that, I add some 5-10% on “I am so wrong about some of this that even the high-level reasoning doesn’t apply”. (Which includes being wrong about where the burden of proofs, and priors, lie for P(doom|AGI).) And that puts me at the (ass-) number 75%.
This is a scenario that I haven’t seen discussed on the EA Forum yet: AI consciousness (with a possible mechanism for it). In his recent podcast with Lex Fridman, Max Tegmark speculates that recurrent neural networks (RNNs) could be a source of consciousness (whereas the linear, feed-forward, architecture of the current dominant architecture of LLMs, isn’t). I’m not sure if this helps with us being fine, as the consciousness could have very negative valence (so it hates us for bringing it into being), but this idea to me seems more of a way out of doom by default (given AGI) than anything else I’ve seen so far.
I believe that there’s more uncertainty about the future than there was previously.
This means that
(a) it’s hard for me to commit to a doom outcome with high confidence
(b) it’s hard for me to commit to any outcome with high confidence
(c) even if I think that doom has <10% chance of happening, it doesn’t mean I can articulate what the rest of the probability space looks like.
To be clear, I think that someone with this set of beliefs, including 1% chance of doom, should be highly concerned and should want action to be taken to keep everyone safe from the risks of AI.
This reminds me a bit of Tyler Cowen’s take (but glad for your last paragraph!). I think Scott Alexander’s response to Cowen is good.
I agree with Tyler Cowen that it’s hard to predict what will happen, although my argument has a (not mega important) nuance that his blog post doesn’t have, namely that the difficulty of predictions is increasing.
A (more important) difference is that I don’t commit what Scott Alexander calls the Safe Uncertainty Fallacy. I’ve encountered that argument a lot with climate sceptics for many years, and have found it infuriating how it’s simultaneously a very bad argument and yet can be made to sound sensible.
A key part of my model right now relies on who develops the first AGI and on how many AGIs are developed.
If the first AGI is developed by OpenAI, Google DeepMind or Anthropic—all of whom seem relatively cautious (perhaps some more than others) - I put the chance of massively catastrophic misalignment at <20%.
If one of those labs is first and somehow able to prevent other actors from creating AGI after this, then that leaves my overall massively catastrophic misalignment risk at <20%. However, while I think it’s likely one of these labs would be first, I’m highly uncertain about whether they would achieve the pivotal outcome of preventing subsequent AGIs.
So, if some less cautious actor overtakes the leading labs, or if the leading lab who first develops AGI cannot prevent many others from building AGI afterward, I view there’s a much higher likelihood of massively catastrophic misalignment from one of these attempts to build AGI. In this scenario, my overall massively catastrophic misalignment risk is definitely >50%, and perhaps closer to the 75%~90% range.
Thanks!
Interested in hearing more about why you think this. How do they go from the current level of really poor alignment (see ref to “29%” here, and all the jailbreaks, and consider that the current models are only relatively safe because they are weak), to perfect alignment? How does their alignment scale? How is “getting the AI to do your alignment homework” even a remotely safe strategy for eliminating catastrophic risk?
I place significant weight on the possibility that when labs are in the process of training AGI or near-AGI systems, they will be able to see alignment opportunities that we can’t from a more theoretical or distanced POV. In this sense, I’m sympathetic to Anthropic’s empirical approach to safety. I also think there are a lot of really smart and creative people working at these labs.
Leading labs also employ some people focused on the worst risks. For misalignment risks, I am most worried about deceptive alignment, and Anthropic recently hired one of the people who coined that term. (From this angle, I would feel safer about these risks if Anthropic were in the lead rather than OpenAI. I know less about OpenAI’s current alignment team.)
Let me be clear though: Even if I’m right above and massively catastrophic misalignment risk one of these labs creating AGI is ~20%, I consider that very much an unacceptably high risk. I think even a 1% chance of extinction is unacceptably high. If some other kind of project had a 1% chance of causing human extinction, I don’t think the public would stand for it. Imagine some particle accelerator or biotech project had a 1% chance of causing human extinction. If the public found out, I think they would want the project shut down immediately until it could be pursued safely. And I think they would be justified in that, if there’s a way to coordinate on doing so.
This just seems like a hell of a reckless gamble to me. And you have to factor in their massive profit-making motivation. Is this really much more than mere safetywashing?
Or, y’know, you could just not build them and avoid the serious safety concerns that way?
Wow. It’s like they are just agreeing with the people who say we need empirical evidence for x-risk, and are fine with offering it (with no democratic mandate to do so!)
Thanks for your last paragraph. Very much agree.
Great question! I wrote a (draft) post kind of answering this recently, basically arguing that even though an AGI that is developed would converge on some instrumental goals, it would likely face powerful competing motivations against disempowering humanity/causing extinction. To note that my argument only applies to AGI ‘misalignment/accident’ risks, and doesn’t rule out AGI misuse risks.
Would love to hear your view on the post!
Thanks! I’ve commented on your post. I think you are assuming that major unsolved problems in alignment (reward hacking, corrigibility, inner alignment, outer alignment) are just somehow magically solved (it reads as though you are unaware of what the major problems are in AI alignment, sorry).