Nora Belrose

Karma: 296

Deconstructing Bostrom’s Classic Argument for AI Doom

Nora Belrose11 Mar 2024 6:03 UTC

25 points

0 comments1 min readEA link

(www.youtube.com)

Nora Belrose 28 Feb 2024 2:27 UTC
4 points
0 ∶ 0
in reply to: Matthew_Barnett’s comment on: Counting arguments provide no evidence for AI doom
The goal realism section was an argument in the alternative. If you just agree with us that the indifference principle is invalid, then the counting argument fails, and it doesn’t matter what you think about goal realism.
If you think that some form of indifference reasoning still works— in a way that saves the counting argument for scheming— the most plausible view on which that’s true is goal realism combined with Huemer’s restricted indifference principle. We attack goal realism to try to close off that line of reasoning.

Nora Belrose 28 Feb 2024 1:09 UTC
1 point
3 ∶ 13
in reply to: Matthew_Barnett’s comment on: Counting arguments provide no evidence for AI doom
I think the title overstates the strength of the conclusion
This seems like an isolated demand for rigor to me. I think it’s fine to say something is “no evidence” when, speaking pedantically, it’s only a negligible amount of evidence.
Ultimately I think you’ve only rebutted one argument for scheming—the counting argument
I mean, we do in fact discuss the simplicity argument, although we don’t go in as much depth.
the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don’t scheme
Without a concrete proposal about what that might look like, I don’t feel the need to address this possibility.
If future AIs are “as aligned as humans”, then AIs will probably scheme frequently
I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.
I don’t think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have “goals” that they robustly attempt to pursue.
We did not intend to deny that some AIs will be well-described as having goals.
What links here?
- Nora Belrose's comment on Counting arguments provide no evidence for AI doom by Nora Belrose (LessWrong; 28 Feb 2024 2:29 UTC; 4 points)

Counting arguments provide no evidence for AI doom

Nora Belrose27 Feb 2024 23:03 UTC

84 points

15 comments1 min readEA link

Nora Belrose 10 Oct 2023 4:15 UTC
2 points
1 ∶ 0
in reply to: evhub’s comment on: AI Pause Will Likely Backfire
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.

Nora Belrose 24 Sep 2023 16:50 UTC
10 points
2 ∶ 2
in reply to: RobertM’s comment on: AI Pause Will Likely Backfire
Anticipating the argument that, since we’re doing the training, we can shape the goals of the systems—this would certainly be reason for optimism if we had any idea what goals we would see emerge while training superintelligent systems, and had any way of actively steering those goals to our preferred ends. We don’t have either, right now.
What does this even mean? I’m pretty skeptical of the realist attitude toward “goals” that seems to be presupposed in this statement. Goals are just somewhat useful fictions for predicting a system’s behavior in some domains. But I think it’s a leaky abstraction that will lead you astray if you take it too seriously / apply it out of the domain in which it was designed for.
We clearly can steer AI’s behavior really well in the training environment. The question is just whether this generalizes. So it becomes a question of deep learning generalization. I think our current evidence from LLMs strongly suggests they’ll generalize pretty well to unseen domains. And as I said in the essay I don’t think the whole jailbreaking thing is any evidence for pessimism— it’s exactly what you’d expect of aligned human mind uploads in the same situation.

Nora Belrose 24 Sep 2023 16:45 UTC
2 points
0 ∶ 0
in reply to: RobertM’s comment on: AI Pause Will Likely Backfire
The positive case is just super obvious, it’s that we’re trying very hard to make these systems aligned, and almost all the data we’re dumping into these systems is generated by humans and is therefore dripping with human values and concepts.
I also think we have strong evidence from ML research that ANN generalization is due to symmetries in the parameter-function map which seem generic enough that they would apply mutatis mutandis to human brains, which also have a singular parameter-function map (see e.g. here).
I do in fact think that evidence from evolution suggests that values are strongly contingent on the kinds of selection pressures which produced various species.
Not really sure what you’re getting at here/why this is supposed to help your side

Nora Belrose 24 Sep 2023 16:36 UTC
1 point
0 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
I’m not conditioning on the global governance mechanism— I assign nonzero probability mass to the “standard treaty” thing— but I think in fact you would very likely need global governance, so that is the main causal mechanism through which tyranny happens in my model

Nora Belrose 21 Sep 2023 16:04 UTC
2 points
1 ∶ 3
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
And you’ve already agreed that it’s implausible that these efforts would lead to tyranny, you think they will just fail.
I think that conditional on the efforts working, the chance of tyranny is quite high (ballpark 30-40%). I don’t think they’ll work, but if they do, it seems quite bad.
And since I think x-risk from technical AI alignment failure is in the 1-2% range, the risk of tyranny is the dominant effect of “actually enforced global AI pause” in my EV calculation, followed by the extra fast takeoff risks, and then followed by “maybe we get net positive alignment research.”
What links here?
- How could a moratorium fail? by Davidmanheim (22 Sep 2023 15:11 UTC; 49 points)

Nora Belrose 19 Sep 2023 15:36 UTC
3 points
0 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
I have now made a clarification at the very top of the post to make it 1000% clear that my opposition is disjunctive, because people repeatedly get confused / misunderstand me on this point.

Nora Belrose 19 Sep 2023 15:32 UTC
5 points
3 ∶ 0
in reply to: RobertM’s comment on: AI Pause Will Likely Backfire
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?

Nora Belrose 19 Sep 2023 15:30 UTC
2 points
1 ∶ 4
in reply to: Zach Stein-Perlman’s comment on: The possibility of an indefinite AI pause
I think this post is best combined with my post. Together, these posts present a coherent, disjunctive set of arguments against pause.

Nora Belrose 19 Sep 2023 15:14 UTC
1 point
0 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
My opposition is disjunctive!
I both think that if it’s possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can’t use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects.
(Also, reread my piece—I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn’t say “wait a while, then impose a pause for a while in a few places.”)
This really sounds like you are committing the fallacy I was worried about earlier on. I just don’t agree that you will actually get the global moratorium. I am fully aware of what your position is.

Nora Belrose 18 Sep 2023 22:47 UTC
2 points
3 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
In my essay I don’t make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we’d need some time to prepare and get multiple countries on board.
I don’t see how a delay before a pause changes anything. I still think it’s highly unlikely you’re going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you’ll back down and do no pause at all.

Nora Belrose 18 Sep 2023 3:03 UTC
3 points
1 ∶ 3
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
What links here?
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (LessWrong; 15 Oct 2023 14:51 UTC; 28 points)

Nora Belrose 17 Sep 2023 22:54 UTC
6 points
0 ∶ 1
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
It’s not obvious to me what alignment optimism has to do with the pause debate
Sorry, I thought it would be fairly obvious how it’s related. If you’re optimistic about alignment then the expected benefits you might hope to get out of a pause (whether or not you actually do get those benefits) are commensurately smaller, so the unintended consequences should have more relative weight in your EV calculation.
To be clear, I think slowing down AI in general, as opposed to the moratorium proposal in particular, is a more reasonable position that’s a bit harder to argue against. I do still think the overhang concerns apply in non-pause slowdowns but in a less acute manner.

Nora Belrose 17 Sep 2023 22:19 UTC
5 points
2 ∶ 1
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
It’s essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it’s not zero cost, but it’s dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by “mechanistic interpretability” you mean “circuits” I’m honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
What links here?
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (LessWrong; 15 Oct 2023 14:51 UTC; 28 points)

Nora Belrose 17 Sep 2023 22:10 UTC
4 points
2 ∶ 2
in reply to: Chris Leong’s comment on: AI Pause Will Likely Backfire
That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days
The claim is more like, “the MIRI days are a cautionary tale about what may happen when alignment research isn’t embedded inside a feedback loop with capabilities.” I don’t literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
However, I’m worried that your [white box] framing is confusing and will cause people to talk past each other.
Perhaps, but I think the current conventional wisdom that neural nets are “black box” is itself a confusing and bad framing and I’m trying to displace it.

Nora Belrose 17 Sep 2023 22:03 UTC
11 points
6 ∶ 4
in reply to: DanielFilan’s comment on: AI Pause Will Likely Backfire
Yep I am aware of the value learning section of Chapter 12, which is why I used the “mostly” qualifier. That said he basically imagines something like Stuart Russell’s CIRL, rather than anything like LLMs or imitation learning.
If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you’ve proven something much stronger, like “intelligence and goals will be empirically uncorrelated in the systems we actually build” or something.

Nora Belrose 17 Sep 2023 21:53 UTC
7 points
2 ∶ 0
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
Yep it’s all meant to be disjunctive and yep it could have been clearer. FWIW this essay went through multiple major revisions and at one point I was trying to make the disjunctivity of it super clear but then that got de-prioritized relative to other stuff. In the future if/when I write about this I think I’ll be able to organize things significantly better

Nora Belrose

De­con­struct­ing Bostrom’s Clas­sic Ar­gu­ment for AI Doom

Count­ing ar­gu­ments provide no ev­i­dence for AI doom

Deconstructing Bostrom’s Classic Argument for AI Doom

Counting arguments provide no evidence for AI doom