Nora Belrose

Karma: 261

Nora Belrose 24 Mar 2023 9:12 UTC
4 points
2 ∶ 0
in reply to: xuan’s comment on: There are no coherence theorems
“good reasoning” is really intersubjective rather than objective! There’s only pressure to find the right logical beliefs in a reasonable amount of time if there are others who would fleece you for not doing so.
This is a really interesting point that reminds me of arguments made by pragmatist philosophers like John Dewey and Richard Rorty. They also wanted to make “justification” an intersubjective phenomenon, of justifying your beliefs to other people. I don’t think they had money-pump arguments in mind though.

AI Pause Will Likely Backfire

Nora Belrose16 Sep 2023 10:21 UTC

129 points

165 comments13 min readEA link

Nora Belrose 16 Sep 2023 13:25 UTC
11 points
1 ∶ 0
on: AI Pause Will Likely Backfire
Unfortunately, this post got published under the wrong username. I’m the Nora who wrote this post. I hope it can be fixed soon.

Nora Belrose 16 Sep 2023 15:00 UTC
2 points
2 ∶ 0
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
Yep, I was also hoping the images could be text-wrapped, but idk if this platform supports that.

Nora Belrose 16 Sep 2023 15:10 UTC
13 points
12 ∶ 8
in reply to: Rafael Harth’s comment on: AI Pause Will Likely Backfire
The opposing take is that all it’s doing is making the AI play a nicer character, but doesn’t lead it to internalize its goals, which is what alignment is actually about.
I think this is a misleading frame which makes alignment seem harder than it actually is. What does it mean to “internalize” a goal? It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
We know a fair bit about how neural nets generalize, although we should study it more (I’m working on a paper on the topic atm). We know they favor “simple” functions, which means something like “low frequency” in the Fourier domain. In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.

Nora Belrose 16 Sep 2023 16:38 UTC
16 points
15 ∶ 2
in reply to: Rafael Harth’s comment on: AI Pause Will Likely Backfire
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.

Nora Belrose 16 Sep 2023 16:49 UTC
10 points
4 ∶ 4
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
Where we agree:
“dangerous-capability-model-eval-based regulation” sounds good to me. I’m also in favor of Robin Hanson’s foom liability proposal. These seem like very targeted measures that would plausibly reduce the tail risk of existential catastrophe, and don’t have many negative side effects. I’m also not opposed to the US trying to slow down other states, although it’d depend on the specifics of the proposal.
Where we (partially) disagree:
I think there’s a plausible case to be made that publishing model weights reduces foom risk by making AI capabilities more broadly distributed, and also enhances security-by-transparency. Of course there are concerns about misuse— I do think that’s a real thing to be worried about— but I also think it’s generally exaggerated. I also relatively strongly favor open source on purely normative grounds. So my inclination is to be in favor of it but with reservations. Same goes for labs publishing capabilities research.

Nora Belrose 16 Sep 2023 16:57 UTC
10 points
8 ∶ 0
in reply to: Rafael Harth’s comment on: AI Pause Will Likely Backfire
Why does it have to be one or the other? I personally don’t put much stock in what Eliezer and Nate think, but many other people do.

Nora Belrose 17 Sep 2023 19:54 UTC
2 points
7 ∶ 9
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
the “pause” would be a temporary measure imposed by some countries, as opposed to a stop-gap solution and regulation imposed to enable stronger international regulation, which Nora says she supports
I don’t understand the distinction you’re trying to make between these two things. They really seem like the same thing to me, because a stop-gap measure is temporary by definition:
If by “stronger international regulation” you mean “global AI pause” I argue explicitly that such a global pause is highly unlikely to happen. You don’t get to assume that your proposed “stop-gap” pause will in fact lead to a global pause just because you called it a stop-gap. What if it doesn’t? Will it be worse than no pause at all in that scenario? That’s a big part of what we’re debating. Is it a “straw man” if I just disagree with you about the likely effects of the policies you’re proposing?
I’m also against a global pause even if we can make it happen, and I say so in the post:
If in spite of all this, we somehow manage to establish a global AI moratorium, I think we should be quite worried that the global government needed to enforce such a ban would greatly increase the risk of permanent tyranny, itself an existential catastrophe. I don’t have time to discuss the issue here, but I recommend reading Matthew Barnett’s “The possibility of an indefinite AI pause” and Quintin Pope’s “AI is centralizing by default; let’s not make it worse,” both submissions to this debate.

Nora Belrose 17 Sep 2023 20:07 UTC
10 points
6 ∶ 2
in reply to: Rafael Harth’s comment on: AI Pause Will Likely Backfire
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).

Nora Belrose 17 Sep 2023 21:53 UTC
7 points
2 ∶ 0
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
Yep it’s all meant to be disjunctive and yep it could have been clearer. FWIW this essay went through multiple major revisions and at one point I was trying to make the disjunctivity of it super clear but then that got de-prioritized relative to other stuff. In the future if/when I write about this I think I’ll be able to organize things significantly better

Nora Belrose 17 Sep 2023 22:03 UTC
11 points
6 ∶ 3
in reply to: DanielFilan’s comment on: AI Pause Will Likely Backfire
Yep I am aware of the value learning section of Chapter 12, which is why I used the “mostly” qualifier. That said he basically imagines something like Stuart Russell’s CIRL, rather than anything like LLMs or imitation learning.
If we treat the Orthogonality Thesis as the crux of the book, I also think the book has aged poorly. In fact it should have been obvious when the book was written that the Thesis is basically a motte-and-bailey where you argue for a super weak claim (any combo of intelligence and goals is logically possible), which is itself dubious IMO but easy to defend, and then pretend like you’ve proven something much stronger, like “intelligence and goals will be empirically uncorrelated in the systems we actually build” or something.

Nora Belrose 17 Sep 2023 22:10 UTC
4 points
2 ∶ 2
in reply to: Chris Leong’s comment on: AI Pause Will Likely Backfire
That if there was a pause, alignment research would magically revert back to what it was back in the MIRI days
The claim is more like, “the MIRI days are a cautionary tale about what may happen when alignment research isn’t embedded inside a feedback loop with capabilities.” I don’t literally believe we would revert back to pure theoretical research during a pause, but I do think the research would get considerably lower quality.
However, I’m worried that your [white box] framing is confusing and will cause people to talk past each other.
Perhaps, but I think the current conventional wisdom that neural nets are “black box” is itself a confusing and bad framing and I’m trying to displace it.

Nora Belrose 17 Sep 2023 22:19 UTC
5 points
2 ∶ 1
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
It’s essentially no cost to run a gradient-based optimizer on a neural network, and I think this is sufficient for good-enough alignment. I view the the interpretability work I do at Eleuther as icing on the cake, allowing us to steer models even more effectively than we already can. Yes, it’s not zero cost, but it’s dramatically lower cost than it would be if we had to crack open a skull and do neurosurgery.
Also, if by “mechanistic interpretability” you mean “circuits” I’m honestly pretty pessimistic about the usefulness of that kind of research, and I think the really-useful stuff is lower cost than circuits-based interp.
What links here?
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (LessWrong; 15 Oct 2023 14:51 UTC; 23 points)

Nora Belrose 17 Sep 2023 22:54 UTC
6 points
0 ∶ 1
in reply to: Zach Stein-Perlman’s comment on: AI Pause Will Likely Backfire
It’s not obvious to me what alignment optimism has to do with the pause debate
Sorry, I thought it would be fairly obvious how it’s related. If you’re optimistic about alignment then the expected benefits you might hope to get out of a pause (whether or not you actually do get those benefits) are commensurately smaller, so the unintended consequences should have more relative weight in your EV calculation.
To be clear, I think slowing down AI in general, as opposed to the moratorium proposal in particular, is a more reasonable position that’s a bit harder to argue against. I do still think the overhang concerns apply in non-pause slowdowns but in a less acute manner.

Nora Belrose 18 Sep 2023 3:03 UTC
3 points
1 ∶ 3
in reply to: Steven Byrnes’s comment on: AI Pause Will Likely Backfire
Differentiability is a pretty big part of the white box argument.
The terabyte compiled executable binary is still white box in a minimal sense but it’s going to take a lot of work to mould that thing into something that does what you want. You’ll have to decompile it and do a lot of static analysis, and Rice’s theorem gets in the way of the kinds of stuff you can prove about it. The code might be adversarially obfuscated, although literal black box obfuscation is provably impossible.
If instead of a terabyte of compiled code, you give me a trillion neural net weights, I can fine tune that network to do a lot of stuff. And if I’m worried about the base model being preserved underneath and doing nefarious things, I can generate synthetic data from the fine tuned model and train a fresh network from scratch on that (although to be fair that’s pretty compute-intensive).
What links here?
- Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (LessWrong; 15 Oct 2023 14:51 UTC; 23 points)

Nora Belrose 18 Sep 2023 22:47 UTC
2 points
3 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
In my essay I don’t make an assumption that the pause would immediate, because I did read your essay and I saw that you were proposing that we’d need some time to prepare and get multiple countries on board.
I don’t see how a delay before a pause changes anything. I still think it’s highly unlikely you’re going to get sufficient international backing for the pause, so you will either end up doing a pause with an insufficiently large coalition, or you’ll back down and do no pause at all.

Nora Belrose 19 Sep 2023 15:14 UTC
1 point
0 ∶ 0
in reply to: Davidmanheim’s comment on: AI Pause Will Likely Backfire
My opposition is disjunctive!
I both think that if it’s possible to stop the building of dangerously large models via international regulation, that would be bad because of tyranny risk, and I also think that we very likely can’t use international regulation to stop building these things, so that any local pauses are not going to have their intended effects and will have a lot of unintended net-negative effects.
(Also, reread my piece—I call for action to regulate and stop larger and more dangerous models immediately as a prelude to a global moratorium. I didn’t say “wait a while, then impose a pause for a while in a few places.”)
This really sounds like you are committing the fallacy I was worried about earlier on. I just don’t agree that you will actually get the global moratorium. I am fully aware of what your position is.

Nora Belrose 19 Sep 2023 15:30 UTC
2 points
1 ∶ 4
in reply to: Zach Stein-Perlman’s comment on: The possibility of an indefinite AI pause
I think this post is best combined with my post. Together, these posts present a coherent, disjunctive set of arguments against pause.

Nora Belrose 19 Sep 2023 15:32 UTC
3 points
2 ∶ 0
in reply to: RobertM’s comment on: AI Pause Will Likely Backfire
Please stop saying that mind-space is an “enormously broad space.” What does that even mean? How have you established a measure on mind-space that isn’t totally arbitrary?
What if concepts and values are convergent when trained on similar data, just like we see convergent evolution in biology?