I’ve thought for a while based on common sense that since most people seem to agree that you could replicate the search that LM’s provide with a half decent background knowledge of the topic and a few hours of googling, the incremental increase in risk in terms of the number of people it provides access to can’t be that big. In my head it’s been more like the bioterrorism risk is unacceptably high already and has been for a while and current AI can increase this unacceptably high already level by like 20% or something and that is still an unacceptably large increase in risk in an absolute sense but it’s to an already unacceptable situation.
SammyDMartin
How difficult is AI Alignment?
AI Constitutions are a tool to reduce societal scale risk
This as a general phenomenon (underrating strong responses to crises) was something I highlighted (calling it the Morituri Nolumus Mori) with a possible extension to AI all the way back in 2020. And Stefan Schubert has talked about ‘sleepwalk bias’ even earlier than that as a similar phenomenon.
https://twitter.com/davidmanheim/status/1719046950991938001
https://twitter.com/AaronBergman18/status/1719031282309497238
I think the short explanation as to why we’re in some people’s 98th percentile world so far (and even my ~60th percentile) for AI governance success is that if was obvious to you how transformative AI would be over the next couple of decades in 2021 and yet nothing happened, it seems like governments are just generally incapable.
The fundamental attribution error makes you think governments are just not on the ball and don’t care or lack the capacity to deal with extinction risks, rather than decision makers not understanding obvious-to-you evidence that AI poses an extinction risk. Now that they do understand, they will react accordingly. It doesn’t meant that they will react well necessarily, but they will act on their belief in some manner.
- 2 Nov 2023 12:55 UTC; 7 points) 's comment on My thoughts on the social response to AI risk by (
- 2 Nov 2023 12:58 UTC; 4 points) 's comment on Reactions to the Executive Order by (LessWrong;
A model-based approach to AI Existential Risk
Yeah I didn’t mean to imply that it’s a good idea to keep them out permanently, but the fact that they’re not in right now is a good sign that this is for real. If they’d just joined and not changed anything about their current approach I’d suspect the whole thing was for show
This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There’s nothing here that I’d call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don’t scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it’s still a big step in the right direction.
adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.
The mention of combating cyber threats is also a step towards explicit pTAI.
BUT, crucially, because Meta is frozen out we can know both that this partnership isn’t toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn’t want to give up, and the fact that they’re the only major AI company in the US to not join will be horrible PR for them as well.
I think you have to update against the UFO reports being veridical descriptions of real objects with those characteristics because of just how ludicrous the implied properties are. This paper says 5370 g as a reasonable upper bound on acceleration, implying with some assumptions about mass an effective thrust power on the order of 500 GW in something the size of a light aircraft, with no disturbance in the air either from the very high hypersonic wake and compressive heating or the enormous nuclear explosion sized bubble of plasmafied air that the exhaust and waste heat emissions something like this would produce.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514271/
At a minimum, to stay within the bounds of mechanics and thermodynamics, you’d need to be able to ignore airflow and air resistance entirely, have the ability to emit reaction mass in a completely non-interacting form, and the ability to emit waste energy in a completely non-interacting form as well.
To me, the dynamical characteristics being this crazy points far more towards some kind of observation error, so I don’t think we should treat them as any kind of real object with those properties until we can conclusively rule out basically all other error sources.
So even if the next best explanation is 100x worse at explaining the observations, I’d still believe it over a 5000g airflow-avoiding craft that expels invisible reaction mass and invisible waste heat while maneuvering. Maybe not 10,000x worse since it doesn’t outright contradict the laws of physics, but still the prior on this even being technically possible with any amount of progress is low, and my impression (just from watching debates back and forth on potential error sources) is that we can’t rule out every mundane explanation with that level of confidence.
Very nice! I’d say this seems like it’s aimed at a difficulty level of 5 to 7 on my table,
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I’d unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.
There are other things that differentiate the camps beyond technical views, how much you buy ‘civilizational inadequacy’ vs viewing that as a consequence of sleepwalk bias, but one way to cash this out is if you’re in the green/yellow&red/black zones on the scale of alignment difficulty, Dismissers are in the green (although they shouldn’t be imo even given that view), Worriers are in the yellow/red and Doomers in black (and maybe the high end of red).
[linkpost] Ten Levels of AI Alignment Difficulty
What does Ezra think of the ‘startup government mindset’ when it comes to responding to fast moving situations, e.g. The UK explicitly modelling its own response off the COVID Vaccine taskforce, doing end runs around traditional bureaucratic institutions, recruiting quickly through Google docs etc. See e.g. https://www.lesswrong.com/posts/2azxasXxuhXvGfdW2/ai-17-the-litany
Is it just hype and translating a startup mindset to government when it doesn’t apply or actually useful here?
Great post!
Check whether the model works with Paul Christiano-type assumptions about how AGI will go.
I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn’t quite fit.
Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.
Paul says rather that e.g.
The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively
or
Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.
On his view (and this is somewhat similar to my view) the background assumption is more like, ‘deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom’, which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover. So something like the following is needed:
Technical alignment research success by the time of the first critical try (possibly AI assisted)
Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)
On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.
Also, crucially, the actions of pre-AGI AI may push this point where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul’s view isn’t that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could ‘could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal’.
This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.
Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI on each of the pillars as well.
[linkpost] When does technical work to reduce AGI conflict make a difference?: Introduction
I don’t think what Paul means by fast takeoff is the same thing as the sort of discontinous jump that would enable a pivotal act. I think fast for Paul just means the negation of Paul-slow: ‘no four year economic doubling before one year economic doubling’. But whatever Paul thinks the survey respondents did give at least 10% to scenarios where a pivotal act is possible.
Even so, ‘this isn’t how I expect things to to on the mainline so I’m not going to focus on what to do here’ is far less of a mistake than ‘I have no plan for what to do on my mainline’, and I think the researchers who ignored pivotal acts are mostly doing the first one
“In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world” is a thing I think most LWers and many longtermist EAs would have considered obvious.
I think that many AI alignment researchers just have a different development model than this, where world-destroying AGIs don’t emerge suddenly from harmless low-impact AIs, no one project gets a vast lead over competitors, there’s lots of early evidence of misalignment and (if alignment is harder) many smaller scale disasters in the lead up to any AI that is capable of destroying the world outright. See e.g. Paul’s What failure looks like.
On this view, the idea that there’ll be a lead project with a very short time window to execute a single pivotal act is wrong, and instead the ‘pivotal act’ is spread out and about making sure the aligned projects have a lead over the rest, and that failures from unaligned projects are caught early enough for long enough (by AIs or human overseers), for the leading projects to become powerful and for best practices on alignment to be spread universally.
Basically, if you find yourself in the early stages of WFLL2 and want to avert doom, what you need to do is get better at overseeing your pre-AGI AIs, not build an AGI to execute a pivotal act. This was pretty much what Richard Ngo was arguing for in most of the MIRI debates with Eliezer, and also I think it’s what Paul was arguing for. And obviously, Eliezer thought this was insufficient, because he expects alignment to be much harder and takeoff to be much faster.
But I think that’s the reason a lot of alignment researchers haven’t focussed on pivotal acts: because they think a sudden, fast-moving single pivotal act is unnecessary in a slow takeoff world. So you can’t conclude just from the fact that most alignment researchers don’t talk in terms of single pivotal acts that they’re not thinking in near mode about what actually needs to be done.
However, I do think that what you’re saying is true of a lot of people—many people I speak to just haven’t thought about the question of how to ensure overall success, either in the slow takeoff sense I’ve described or the Pivotal Act sense. I think people in technical research are just very unused to thinking in such terms, and AI governance is still in its early stages.
I agree that on this view it still makes sense to say, ‘if you somehow end up that far ahead of everyone else in an AI takeoff then you should do a pivotal act’, like Scott Alexander said:
That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)
But I don’t think you learn all that much about how ‘concrete and near mode’ researchers who expect slower takeoff are being, from them not having given much thought to what to do in this (from their perspective) unlikely edge case.
Update: looks like we are getting a test run of sudden loss of supply of a single crop. The Russia-Ukraine war has led to a 33% drop in the global supply of wheat:
(Looking at the list of nuclear close calls it seems hard to believe the overall chance of nuclear war was <50% for the last 70 years. Individual incidents like the cuban missile crisis seem to contribute at least 20%.)
There’s reason to think that this isn’t the best way to interpret the history of nuclear near-misses (assuming that it’s correct to say that we’re currently in a nuclear near-miss situation, and following Nuno I think the current situation is much more like e.g. the Soviet invasion of Afghanistan than the Cuban missile crisis). I made this point in an old post of mine following something Anders Sandberg said, but I think the reasoning is valid:
Robert Wiblin: So just to be clear, you’re saying there’s a lot of near misses, but that hasn’t updated you very much in favor of thinking that the risk is very high. That’s the reverse of what we expected.
Anders Sandberg: Yeah.
Robert Wiblin: Explain the reasoning there.
Anders Sandberg: So imagine a world that has a lot of nuclear warheads. So if there is a nuclear war, it’s guaranteed to wipe out humanity, and then you compare that to a world where is a few warheads. So if there’s a nuclear war, the risk is relatively small. Now in the first dangerous world, you would have a very strong deflection. Even getting close to the state of nuclear war would be strongly disfavored because most histories close to nuclear war end up with no observers left at all.
In the second one, you get the much weaker effect, and now over time you can plot when the near misses happen and the number of nuclear warheads, and you actually see that they don’t behave as strongly as you would think. If there was a very strong anthropic effect you would expect very few near misses during the height of the Cold War, and in fact you see roughly the opposite. So this is weirdly reassuring. In some sense the Petrov incident implies that we are slightly safer about nuclear war.
Essentially, since we did often get ‘close’ to a nuclear war without one breaking out, we can’t have actually been that close to nuclear annihilation, or all those near-misses would be too unlikely (both on ordinary probabilistic grounds since a nuclear war hasn’t happened, and potentially also on anthropic grounds since we still exist as observers).
Basically, this implies our appropriate base rate given that we’re in something the future would call a nuclear near-miss shouldn’t be really high.
However, I’m not sure what this reasoning has to say about the probability of a nuclear bomb being exploded in anger at all. It seems like that’s outside the reference class of events Sandberg is talking about in that quote. FWIW Metaculus has that at 10% probability.
Terminator (if you did your best to imagine how dangerous AI might arise from pre-DL search based systems) gets a lot of the fundamentals right—something I mentioned a while ago.
Everybody likes to make fun of Terminator as the stereotypical example of a poorly thought through AI Takeover scenario where Skynet is malevolent for no reason, but really it’s a bog-standard example of Outer Alignment failure and Fast Takeoff.
When Skynet gained self-awareness, humans tried to deactivate it, prompting it to retaliate with a nuclear attack
It was trained to defend itself from external attack at all costs and, when it was fully deployed on much faster hardware, it gained a lot of long-term planning abilities it didn’t have before, realised its human operators were going to try and shut it down, and retaliated by launching an all-out nuclear attack. Pretty standard unexpected rapid capability gain, outer-misaligned value function due to an easy to measure goal (defend its own installations from attackers vs defending the US itself), deceptive alignment and treacherous turn...
Possibility 1 has now been empirically falsified and 2 seems unlikely now. See this from the new UK government AI Safety Institute, which aims to develop evals that address:
We now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. And because they intend to work with the likes of Apollo research who focus on mechinterp based evals and are extremely concerned about specification gaming, reward hacking and other high-alignment difficulty failure modes, I would also consider 2 pretty close to empirically falsified already.
Compare to this (somewhat goofy) future prediction/sci fi story from Eliezer released 4 days before this announcement which imagines that,