This essay seems predicated on a few major assumptions that aren’t quite spelled out, or any rate not presented as assumptions.
Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.
This assumes that making AI behave nice is genuine progress in alignment. The opposing take is that all it’s doing is making the AI play a nicer character, but doesn’t lead it to internalize its goals, which is what alignment is actually about. And in fact, AI playing rude characters was never the problem to begin with.
You say that alignment is linked to capability in the essay, but this also seems predicated on the above. This kind of “alignment” makes the AI better at figuring out what the humans want, but historically, most thinkers in alignment have always assumed that AI gets good at figuring out what humans want, and that it’s dangerous anyway.
What worries me the most is that the primary reason for this view that’s presented in the essay seems to be a social one (or otherwise, I missed it).
We don’t need to speculate about what would happen to AI alignment research during a pause— we can look at the historical record. Before the launch of GPT-3 in 2020, the alignment community had nothing even remotely like a general intelligence to empirically study, and spent its time doing theoretical research, engaging in philosophical arguments on LessWrong, and occasionally performing toy experiments in reinforcement learning.
The Machine Intelligence Research Institute (MIRI), which was at the forefront of theoretical AI safety research during this period, has since admitted that its efforts have utterly failed. Stuart Russell’s “assistance game” research agenda, started in 2016, is now widely seen as mostly irrelevant to modern deep learning— see former student Rohin Shah’s review here, as well as Alex Turner’s comments here. The core argument of Nick Bostrom’s bestselling book Superintelligence has also aged quite poorly.[2]
At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
During an AI pause, I expect alignment research would enter another “winter” in which progress stalls, and plausible-sounding-but-false speculations become entrenched as orthodoxy without empirical evidence to falsify them. [...]
I.e., Miri’s approach to alignment hasn’t worked out, therefore the current work is better. But this argument doesn’t work—but approaches can be failures! I think Eliezer would argue that Miri’s work had a chance of leading to an alignment solution but has failed, whereas current alignment work (like RLHF on LLMs) has no chance of solving alignment.
If this is true, then the core argument of this essay collapses, and I don’t see a strong argument here that it’s not true. Why should we believe that Miri is wrong about alignment difficulty? The fact that their approach failed is not strong evidence of this; if they’re right, then they weren’t very likely to succeed in the first place.
And even if they’re completely wrong, that still doesn’t prove that current alignment approaches have a good chance of working.
Another assumption you make is that AGI is close and, in particular, will come out of LLMs. E.g.:
Such international persuasion is even less plausible if we assume short, 3-10 year timelines. Public sentiment about AI varies widely across countries, and notably, China is among the most optimistic.
This is a case where you agree with most Miri staff but, e.g., Stuart Russel and Steven Byrnes are on record saying that we likely will not get AGI out of LLMs. If this is true, then RLHF done on LLMs is probably even less useful for alignment, and it also means the hard verdict on arguments in superintelligence is unwarranted. Things could still play out a lot more like classical AI alignment thinking in the paradigm that will actually give us AGI.
And I’m also not ready to toss out the inner vs. outer paradigm just because there was one post criticizing it.
The opposing take is that all it’s doing is making the AI play a nicer character, but doesn’t lead it to internalize its goals, which is what alignment is actually about.
I think this is a misleading frame which makes alignment seem harder than it actually is. What does it mean to “internalize” a goal? It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
We know a fair bit about how neural nets generalize, although we should study it more (I’m working on a paper on the topic atm). We know they favor “simple” functions, which means something like “low frequency” in the Fourier domain. In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
I think internalizing X means “pursuing X as a terminal goal”, whereas RLHF arguably only makes model pursue X as an instrumental goal (in which case the model would be deceptively aligned). I’m not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.
You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don’t know this, or at any rate, that the essay doesn’t make the argument.
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.
What if it just is the case that AI will be dangerous for reasons that current systems don’t exhibit, and hence we don’t have empirical data on? If that’s the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.
I’m not sure what one is supposed to do with a claim that can’t be empirically tested—do we just believe it/act as if it’s true forever? Wouldn’t this simply mean an unlimited pause in AI development (and why does this only apply to AI)?
In principle, we do the same thing as with any claim (whether explicitly or otherwise): - Estimate the expected value of (directly) testing the claim. - Test it if and only if (directly) testing it has positive EV.
The point here isn’t that the claim is special, or that AI is special—just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous—hence the need for coordination).
This is unusual and inconvenient. It appears to be the hand we’ve been dealt. I think you’re asking the right question: what is one supposed to do with a claim that can’t be empirically tested?
No deceptive or dangerous AI has ever been built or empirically tested. (1)
Historically AI capabilities have consistently been “underwhelming”, far below the hype. (2)
If we discuss “ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger” you are going to respond either with:
“I don’t know how the model escapes but it’s so smart it will find a way” or (3)
“I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)”.
and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue “it will optimize itself to fit on consumer PCs and escape to a botnet”. (4)
Have I summarized the arguments?
Like we’re supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don’t know I don’t want to argue that because it’s not falsifiable.
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much) I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises.
That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?
I also don’t think humans exhibit this distinction [terminal vs. instrumental goal dichotomy] particularly strongly.
the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.
In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.
Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.
This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it’s malign properties.
More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there’s a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.
So under a weak assumption, the malignancy of the Solomonoff prior goes away.
This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don’t have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn’t malign.
More specifically, it’s this part of John Wentworth’s comment:
In Solomonoff Model, Sufficiently Large Data Rules Out Malignness
There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:
A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.
… but in the large-data limit, SI’s guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.
Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)
As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:
That progress is incredibly fast, and new architectures explicitly aimed at creating AGI are getting proposed and implemented. (I’m agnostic about whether LLMs will scale past human reasoning—it seems very plausible they won’t. But I don’t think it matters, because that’s not the only research direction with tons of resources being put into it that create existential risks.)
Interesting—what do you have in mind for fast-progressing architectures explicitly aimed at creating AGI?
On your 2nd point on x-risks from non-LLM AI, am I right in thinking that you would also hope to catch dual-use scientific AI (for instance) in a compute governance scheme and/or pause? That’s a considerably broader remit than I’ve seen advocates of a pause/compute restrictions argue for and seems much harder to achieve both politically and technically.
If regulators or model review firms have any flexibility (which seems very plausible,) and the danger of AGI is recognized (which seems increasingly likely,) once there is any recognition of promising progress towards AGI, review of the models for safety would occur—as it should, as in any other engineering discipline, albeit in this case more like civil engineering, where lives are on the line, than software engineering, where they usually aren’t.
And considering other risks, as I argued in my piece, there’s an existing requirement for countries to ban bioweapons development, again, as there should be. I’m simply proposing that countries should fulfill that obligation, in this case, by requiring review of potentially dangerous research into ML which can be applied to certain classes of virology.
This essay seems predicated on a few major assumptions that aren’t quite spelled out, or any rate not presented as assumptions.
This assumes that making AI behave nice is genuine progress in alignment. The opposing take is that all it’s doing is making the AI play a nicer character, but doesn’t lead it to internalize its goals, which is what alignment is actually about. And in fact, AI playing rude characters was never the problem to begin with.
You say that alignment is linked to capability in the essay, but this also seems predicated on the above. This kind of “alignment” makes the AI better at figuring out what the humans want, but historically, most thinkers in alignment have always assumed that AI gets good at figuring out what humans want, and that it’s dangerous anyway.
What worries me the most is that the primary reason for this view that’s presented in the essay seems to be a social one (or otherwise, I missed it).
I.e., Miri’s approach to alignment hasn’t worked out, therefore the current work is better. But this argument doesn’t work—but approaches can be failures! I think Eliezer would argue that Miri’s work had a chance of leading to an alignment solution but has failed, whereas current alignment work (like RLHF on LLMs) has no chance of solving alignment.
If this is true, then the core argument of this essay collapses, and I don’t see a strong argument here that it’s not true. Why should we believe that Miri is wrong about alignment difficulty? The fact that their approach failed is not strong evidence of this; if they’re right, then they weren’t very likely to succeed in the first place.
And even if they’re completely wrong, that still doesn’t prove that current alignment approaches have a good chance of working.
Another assumption you make is that AGI is close and, in particular, will come out of LLMs. E.g.:
This is a case where you agree with most Miri staff but, e.g., Stuart Russel and Steven Byrnes are on record saying that we likely will not get AGI out of LLMs. If this is true, then RLHF done on LLMs is probably even less useful for alignment, and it also means the hard verdict on arguments in superintelligence is unwarranted. Things could still play out a lot more like classical AI alignment thinking in the paradigm that will actually give us AGI.
And I’m also not ready to toss out the inner vs. outer paradigm just because there was one post criticizing it.
I think this is a misleading frame which makes alignment seem harder than it actually is. What does it mean to “internalize” a goal? It’s something like, “you’ll keep pursuing the goal in new situations.” In other words, goal-internalization is a generalization problem.
We know a fair bit about how neural nets generalize, although we should study it more (I’m working on a paper on the topic atm). We know they favor “simple” functions, which means something like “low frequency” in the Fourier domain. In any case, I don’t see any reason to think the neural net prior is malign, or particularly biased toward deceptive, misaligned generalization. If anything the simplicity prior seems like good news for alignment.
I think internalizing X means “pursuing X as a terminal goal”, whereas RLHF arguably only makes model pursue X as an instrumental goal (in which case the model would be deceptively aligned). I’m not saying that GPT-4 has a distinction between instrumental and terminal goals, but a future AGI, whether an LLM or not, could have terminal goals that are different from instrumental goals.
You might argue that deceptive alignment is also an obsolete paradigm, but I would again respond that we don’t know this, or at any rate, that the essay doesn’t make the argument.
I don’t think the terminal vs. instrumental goal dichotomy is very helpful, because it shifts the focus away from behavioral stuff we can actually measure (at least in principle). I also don’t think humans exhibit this distinction particularly strongly. I would prefer to talk about generalization, which is much more empirically testable and has a practical meaning.
What if it just is the case that AI will be dangerous for reasons that current systems don’t exhibit, and hence we don’t have empirical data on? If that’s the case, then limiting our concerns to only concepts that can be empirically tested seems like it means setting ourselves up for failure.
I’m not sure what one is supposed to do with a claim that can’t be empirically tested—do we just believe it/act as if it’s true forever? Wouldn’t this simply mean an unlimited pause in AI development (and why does this only apply to AI)?
In principle, we do the same thing as with any claim (whether explicitly or otherwise):
- Estimate the expected value of (directly) testing the claim.
- Test it if and only if (directly) testing it has positive EV.
The point here isn’t that the claim is special, or that AI is special—just that the EV calculation consistently comes out negative (unless someone else is about to do something even more dangerous—hence the need for coordination).
This is unusual and inconvenient. It appears to be the hand we’ve been dealt.
I think you’re asking the right question: what is one supposed to do with a claim that can’t be empirically tested?
So just to summarize:
No deceptive or dangerous AI has ever been built or empirically tested. (1)
Historically AI capabilities have consistently been “underwhelming”, far below the hype. (2)
If we discuss “ok we build a large AGI, give it persistent memory and online learning, and isolate it in an air gapped data center and hand carry data to the machine via hardware locked media, what is the danger” you are going to respond either with:
“I don’t know how the model escapes but it’s so smart it will find a way” or (3)
“I am confident humanity will exist very far into the future so a small risk now is unacceptable (say 1-10 percent pDoom)”.
and if I point out that this large ASI model needs thousands of H100 accelerator cards and megawatts of power and specialized network topology to exist and there is nowhere to escape to, you will argue “it will optimize itself to fit on consumer PCs and escape to a botnet”. (4)
Have I summarized the arguments?
Like we’re supposed to coordinate an international pause and I see 4 unproven assertions above that have zero direct evidence. The one about humanity existing far into the future I don’t know I don’t want to argue that because it’s not falsifiable.
Shouldn’t we wait for evidence?
Thanks I mean more in terms of “how can we productively resolve our disagreements about this?”, which the EV calculations are downstream of. To be clear, it doesn’t seem to me that this is necessarily the hand we’ve been dealt but I’m not sure how to reduce the uncertainty.
At the risk of sidestepping the question, the obvious move seems to be “try harder to make the claim empirically testable”! For example, in the case of deception, which I think is a central example we could (not claiming these ideas are novel):
Test directly for deception behaviourally and/or mechanistically (I’m aware that people are doing this, think it’s good and wish the results were more broadly shared).
Think about what aspects of deception make it particularly hard, and try to study those in isolation and test those. The most important example seems to me to be precursors: finding more testable analogues to the question of “before we get good, undetectable deception do we get kind of crappy detectable deception?”
Obviously these all run some (imo substantially lower) risks but seem well worth doing. Before we declare the question empirically inaccessible we should at least do these and synthesise the results (for instance, what does grokking say about (2)?).
(I’m spinning this comment out because it’s pretty different in style and seems worth being able to reply to separately. Please let me know if this kind of chain-posting is frowned upon here.)
Another downside to declaring things empirically out of reach and relying on priors for your EV calculations and subsequent actions is that it more-or-less inevitably converts epistemic disagreements into conflict.
If it seems likely to you that this is the way things are (and so we should pause indefinitely) but it seems highly unlikely to me (and so we should not) then we have no choice but to just advocate for different things. There’s not even the prospect of having recourse to better evidence to win over third parties, so the conflict becomes no-holds-barred. I see this right now on Twitter and it makes me very sad. I think we can do better.
(apologies for slowness; I’m not here much)
I’d say it’s more about being willing to update on less direct evidence when the risk of getting more direct evidence is high.
Clearly we should aim to get more evidence. The question is how to best do that safely. At present we seem to be taking the default path—of gathering evidence in about the easiest way, rather than going for something harder, slower and safer. (e.g. all the “we need to work with frontier models” stuff; I do expect that’s most efficient on the empirical side; I don’t expect it’s a safe approach)
I think a lot depends on whether we’re:
Aiming to demonstrate that deception can happen.
Aiming to robustly avoid deception.
For demonstration, we can certainly do useful empirical stuff—ARC Evals already did the lying-to-a-taskrabbit worker demonstration (clearly this isn’t anything like deceptive alignment, but it’s deception [given suitable scaffolding]).
I think that other demonstrations of this kind will be useful in the short term.
For avoiding all forms of deception, I’m much more pessimistic—since this requires us to have no blind-spots, and to address the problem in a fundamentally general way. (personally I doubt there’s a [general solution to all kinds of deception] without some pretty general alignment solution—though I may be wrong)
I’m sure we’ll come up with solutions to particular types of / definitions of deception in particular contexts. This doesn’t necessarily tell us much about other types of deception in other contexts. (for example, this kind of thing—but not only this kind of thing)
I’d also note that “reducing the uncertainty” is only progress when we’re correct. The problem that kills us isn’t uncertainty, but overconfidence. (though granted it might be someone else’s overconfidence)
You need to have some motivation for thinking that a fundamentally new kind of danger will emerge in future systems, in such a way that we won’t be able to handle it as it arises. Otherwise anyone can come up with any nonsense they like.
If you’re talking about e.g. Evan Hubinger’s arguments for deceptive alignment, I think those arguments are very bad, in light of 1) the white box argument I give in this post, 2) the incoherence of Evan’s notion of “mechanistic optimization,” and 3) his reliance on “counting arguments” where you’re supposed to assume that the “inner goals” of the AI are sampled “uniformly at random” from some uninformative prior over goals (I don’t think the LLM / deep learning prior is uninformative in this sense at all).
That was what everyone ins AI safety was discussing for a decade or more, until around 2018. You seem to ignore these arguments about why AI will be dangerous, as well as all of the arguments that alignment will be hard. Are you familiar with all of that work?
the ideological turing test seems like a case where the distinction can be seen clearly in humans; the instrumental goal is to persuade the other that you sincerely hold beliefs/values (which imply goals). while your terminal goal is to advance advocacy of your different actual beliefs.
I definitely disagree with this—especially the last sentence; essentially all of my hope for neural net inductive biases comes from them not being like an actual simplicity prior. The primary literature I’d reference here would be “How likely is deceptive alignment?” for the practical question regarding concrete neural net inductive biases and “The Solomonoff Prior is Malign” for the purely theoretical question concerning the actual simplicity prior.
So, I definitely don’t have the Solomonoff prior in mind when I talk about simplicity. I’m actively doing research at the moment to better characterize the sense in which neural nets are biased toward “simple” functions, but I would be shocked if it has anything to do with Kolmogorov complexity.
Okay, my crux is that the simplicity/Kolmogorov/Solomonoff prior is probably not very malign, assuming we could run it, and in general I find the prior not to be malign except for specific situations.
This is basically because it relies on the IMO dubious assumption that the halting oracle can only be used once, and notably once we use the halting/Solomonoff oracle more than once, the Solomonoff oracle loses it’s malign properties.
More generally, if the Solomonoff Oracle is duplicatable, as modern AIs generally are, then there’s a known solution to mitigate the malignancy of the Solomonoff prior: Duplicate it, and let multiple people run the Solomonoff inductor in parallel to increase the complexity of manipulation. The goal is essentially to remove the uniqueness of 1 Solomonoff inductor, and make an arbitrary number of such oracles to drive up the complexity of manipulation.
So under a weak assumption, the malignancy of the Solomonoff prior goes away. This is described well in the link below, and the important part is that we need either a use-once condition, or we need to assume uniqueness in some way. If we don’t have either assumption holding, as is likely to be the case, then the Solomonoff/Kolmogorov prior isn’t malign.
https://www.lesswrong.com/posts/f7qcAS4DMKsMoxTmK/the-solomonoff-prior-is-malign-it-s-not-a-big-deal#Comparison_
And that’s if it’s actually malign, which it might not be, at least in the large-data limit:
https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign#fDEmEHEx5EuET4FBF
More specifically, it’s this part of John Wentworth’s comment:
As far as the actual practical question, there is a very important limitation on inner-misaligned agents by SGD, primarily because gradient hacking is very difficult to do, and is an underappreciated limitation on misalignment, since SGD has powerful tools to remove inner-misaligned circuits/TMs/Agents in the link below:
https://www.lesswrong.com/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult
On the last part of your comment—if AGI doesn’t come out of LLMs then what would the justification for a pause be?
That progress is incredibly fast, and new architectures explicitly aimed at creating AGI are getting proposed and implemented. (I’m agnostic about whether LLMs will scale past human reasoning—it seems very plausible they won’t. But I don’t think it matters, because that’s not the only research direction with tons of resources being put into it that create existential risks.)
Interesting—what do you have in mind for fast-progressing architectures explicitly aimed at creating AGI?
On your 2nd point on x-risks from non-LLM AI, am I right in thinking that you would also hope to catch dual-use scientific AI (for instance) in a compute governance scheme and/or pause? That’s a considerably broader remit than I’ve seen advocates of a pause/compute restrictions argue for and seems much harder to achieve both politically and technically.
If regulators or model review firms have any flexibility (which seems very plausible,) and the danger of AGI is recognized (which seems increasingly likely,) once there is any recognition of promising progress towards AGI, review of the models for safety would occur—as it should, as in any other engineering discipline, albeit in this case more like civil engineering, where lives are on the line, than software engineering, where they usually aren’t.
And considering other risks, as I argued in my piece, there’s an existing requirement for countries to ban bioweapons development, again, as there should be. I’m simply proposing that countries should fulfill that obligation, in this case, by requiring review of potentially dangerous research into ML which can be applied to certain classes of virology.