I take this post to argue that, just as an AGIâs alignment property wonât generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also wonât generalise well out-of-distribution. Does that seem like a fair (if brief) summary?
As an aside, I feel like itâs more fruitful to talk about specific classes of defects rather than all of them together. You use the word âbugâ to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like âthe inherent bugginess of AI is a very good thing for AI safetyâ, whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/âcorrect behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
Yes, thatâs a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/âbug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
I feel like itâs more fruitful to talk about specific classes of defects rather than all of them together. You use the word âbugâ to mean everything from divide by zero crashes to wrong beliefs
Thatâs fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, iâve edited it to be more accurate.
The belief is that as soon as we create an AI with at least human-level general intelligence, it will be relatively easy to use itâs superior reasoning, extensive knowledge, and superhuman thinking speed to take over the world.
This depends on what âhuman-levelâ means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesnât really matter whether we call that âhuman-levelâ or not.
overall it seems like âmake AI stupidâ is a far easier task than âmake the AIâs goals perfectly alignedâ.
Sure. But the relevant task isnât make something that wonât kill you. Itâs more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
This is very true. However, the OPâs point still helps us, as AI that is simultaneously smart enough to be useful in a narrow domain, misaligned, but also too stupid to take over the world could help us reduce xrisk. In particular, if it is superhumanly good at alignment research, then it could output good alignment research as part of its deception phase. This would help reduce the risk from future AIâs significantly without causing xrisk as, ex hypothesi, the AI is too stupid to take over. The main question here is whether an AI could be smart enough to do very good alignment research and also too stupid to take over the world if it tried. I am skeptical but pretty uncertain, so I would give it at least a 10% chance of being true, and maybe higher.
This depends on what âhuman-levelâ means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesnât really matter whether we call that âhuman-levelâ or not.
Indeed, this post is not an attempt to argue that AGI could never be a threat, merely that the âthreshold for subjugationâ is much higher than âany AGIâ, as many people imply. Human-level is just a marker for a level of intelligence that most people will agree counts as AGI, but (due to mental flaws) is most likely not capable of world domination. For example, I do not believe an AI brain upload of bobby fischer could take over the world.
This makes a difference, because it means that the world in which the actual x-risk AGI comes into being is one in which a lot of earlier, non-deadly AGI already exist and can be studied, or used against the rogue.
Sure. But the relevant task isnât make something that wonât kill you. Itâs more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
Current narrow machine learning AI is extraordinarily stupid at things it isnât trained for, and yet it still is massively funded and incredibly powerful. Nobody is hankering to put a detailed understanding of quantum mechanics into Dall-E. A âstupidity about world dominationâ module, focused on a few key dangerous areas like biochemistry, could potentially be implemented into most AIâs without affecting performance at all. Wouldnât solve the problem entirely, but it would help mitigate risk.
Alternatively, if you want to âmake something that will stop AI from killing usâ (presumably an AGI), you need to make sure that it canât kill us instead, and that could also be helped by deliberate flaws and ignorance. So make it an idiot savant at terminating AIâs, but not at other things.
Buy the argument or donât, but this is a straw man.
Yeah, the first version will be a buggy mess, but the argument is that first version that runs well enough to do anything will be debugged enough to be a threat. The mistake here is to claim that âfirst AGIâ is going to be the final versionâthatâs not what happens with software, and iterationâeven if itâs over a couple yearsâis far faster than our realization of a potential problem. And the claim is that things start going wrong will be after enough bugs have been worked out, and then it will be too late.
So, I think there is a threshold of intelligence and bug-free-ness (which iâll just call rationality) that will allow an AI to escape and attempt to attack humanity.
I also think there is a threshold of intelligence and rationality that could allow an AI to actually succeed in subjugating us all.
I believe that the second threshold is much, much higher than the first, and we would expect to see huge numbers of AI versions that pass the first threshold but not the second. If a pre-alpha build is intelligent enough to escape, they will be the first builds to attack.
Even if weâre looking at released builds though, those builds will only be debugged within specific domains. Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.
Note: The below is all speculativeâIâm much more interested in pushing back against your seeming confidence in your model than saying Iâm confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering nowâI just donât think we should be at all confident they work, and should be near-certain they wonât happen by default.
That said, I donât agree that itâs obvious that the two thresholds you mention are far apart, on the relevant scaleâthough how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from âbeat some good Go playersâ to âunambiguously better than the best living playersâ was a few months.
The second point is that I think that the jump from âaround human competenceâ to âsmarter than most /â all humansâ is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who donât actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you donât need to explicitly train people to reason in specific domainsâthey find books and build inter-domain knowledge on their own. I donât see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
I think the OPâs argument depends on the idea that âNobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.â If AIâs have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/âbugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AIâs could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/âtext outputting to escape a box) while failing to make as much progress in the more OOD domains.
Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of âgeneral intelligenceâ there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesnât generalize well seems to ignore thisâthough I agree itâs a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
Interesting perspective. Though leaning on Cotraâs recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that itâs an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run). I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview. Thatâs why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although Iâm skeptical that it would).
I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldnât just shove it in a robot body and expect it do a backflip on itâs first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I donât think itâs possible to reach that level without vast, vast amounts of empirical testing.
Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
So I agree that it wonât be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotraâs example) Itâs also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowskyâs ridiculous âmake a nanofactory in a beaker from first principlesâ strategy.
I still think this plan is doomed to fail (for early AGI). Itâs multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really canât avoid âbackflip stepsâ in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for ârunning a secret globe-spanning conspiracyâ, so it will inevitably make mistakes there. If we discover it before itâs ready to defeat us, it loses. Also, by the time it pulls the trigger on itâs plan, there will be other AGIâs around, and other examples of failed attacks that put humanity on alert.
A key crux here seems to be your claim that AIâs will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
Narrow AIs have moved from buggy/âmediocre to hyper-competent very quickly (months). If early AGIs are widely copied/âescaped, the global resolve and coordination required to contain them would be unprecedented in breadth and speed.
I expect warning shots, and expect them to be helpful (vs no shots), but take very little comfort in that.
Theyâve learned within months for certain problems where learning can be done at machine speeds, ie game-like problems where it can âplay against itselfâ or problems where huge amounts of data are available in machine-friendly format. But that isnât the case for every application. For example, developing self driving cars up to perfection level has taken way, way longer than expected, partially because it has to deal with freak events that are outside the norm, so a lot more experience and data has to be built up, which takes human time. (of course, humans are also not great at freak events, but remember weâre aiming for perfection here). I think most tasks involved in taking over the world will look a lot more like self-driving cars than playing Go, which inevitably means mistakes, and a lot of them.
I strongly agree with you on points one and two, though Iâm not super confident on three. For me the biggest takeaway is we should be putting more effort into attempts to instill âfalseâ beliefs which are safety-promoting and self-stable.
I take this post to argue that, just as an AGIâs alignment property wonât generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also wonât generalise well out-of-distribution. Does that seem like a fair (if brief) summary?
As an aside, I feel like itâs more fruitful to talk about specific classes of defects rather than all of them together. You use the word âbugâ to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like âthe inherent bugginess of AI is a very good thing for AI safetyâ, whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/âcorrect behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
Yes, thatâs a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/âbug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
Thatâs fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, iâve edited it to be more accurate.
This depends on what âhuman-levelâ means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesnât really matter whether we call that âhuman-levelâ or not.
Sure. But the relevant task isnât make something that wonât kill you. Itâs more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
This is very true. However, the OPâs point still helps us, as AI that is simultaneously smart enough to be useful in a narrow domain, misaligned, but also too stupid to take over the world could help us reduce xrisk. In particular, if it is superhumanly good at alignment research, then it could output good alignment research as part of its deception phase. This would help reduce the risk from future AIâs significantly without causing xrisk as, ex hypothesi, the AI is too stupid to take over. The main question here is whether an AI could be smart enough to do very good alignment research and also too stupid to take over the world if it tried. I am skeptical but pretty uncertain, so I would give it at least a 10% chance of being true, and maybe higher.
Indeed, this post is not an attempt to argue that AGI could never be a threat, merely that the âthreshold for subjugationâ is much higher than âany AGIâ, as many people imply. Human-level is just a marker for a level of intelligence that most people will agree counts as AGI, but (due to mental flaws) is most likely not capable of world domination. For example, I do not believe an AI brain upload of bobby fischer could take over the world.
This makes a difference, because it means that the world in which the actual x-risk AGI comes into being is one in which a lot of earlier, non-deadly AGI already exist and can be studied, or used against the rogue.
Current narrow machine learning AI is extraordinarily stupid at things it isnât trained for, and yet it still is massively funded and incredibly powerful. Nobody is hankering to put a detailed understanding of quantum mechanics into Dall-E. A âstupidity about world dominationâ module, focused on a few key dangerous areas like biochemistry, could potentially be implemented into most AIâs without affecting performance at all. Wouldnât solve the problem entirely, but it would help mitigate risk.
Alternatively, if you want to âmake something that will stop AI from killing usâ (presumably an AGI), you need to make sure that it canât kill us instead, and that could also be helped by deliberate flaws and ignorance. So make it an idiot savant at terminating AIâs, but not at other things.
Buy the argument or donât, but this is a straw man.
Yeah, the first version will be a buggy mess, but the argument is that first version that runs well enough to do anything will be debugged enough to be a threat. The mistake here is to claim that âfirst AGIâ is going to be the final versionâthatâs not what happens with software, and iterationâeven if itâs over a couple yearsâis far faster than our realization of a potential problem. And the claim is that things start going wrong will be after enough bugs have been worked out, and then it will be too late.
So, I think there is a threshold of intelligence and bug-free-ness (which iâll just call rationality) that will allow an AI to escape and attempt to attack humanity.
I also think there is a threshold of intelligence and rationality that could allow an AI to actually succeed in subjugating us all.
I believe that the second threshold is much, much higher than the first, and we would expect to see huge numbers of AI versions that pass the first threshold but not the second. If a pre-alpha build is intelligent enough to escape, they will be the first builds to attack.
Even if weâre looking at released builds though, those builds will only be debugged within specific domains. Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.
Note: The below is all speculativeâIâm much more interested in pushing back against your seeming confidence in your model than saying Iâm confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering nowâI just donât think we should be at all confident they work, and should be near-certain they wonât happen by default.
That said, I donât agree that itâs obvious that the two thresholds you mention are far apart, on the relevant scaleâthough how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from âbeat some good Go playersâ to âunambiguously better than the best living playersâ was a few months.
The second point is that I think that the jump from âaround human competenceâ to âsmarter than most /â all humansâ is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who donât actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you donât need to explicitly train people to reason in specific domainsâthey find books and build inter-domain knowledge on their own. I donât see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
I think the OPâs argument depends on the idea that âNobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.â If AIâs have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/âbugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AIâs could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/âtext outputting to escape a box) while failing to make as much progress in the more OOD domains.
Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of âgeneral intelligenceâ there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesnât generalize well seems to ignore thisâthough I agree itâs a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
Interesting perspective. Though leaning on Cotraâs recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that itâs an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run).
I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview.
Thatâs why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although Iâm skeptical that it would).
I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldnât just shove it in a robot body and expect it do a backflip on itâs first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I donât think itâs possible to reach that level without vast, vast amounts of empirical testing.
Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
So I agree that it wonât be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotraâs example) Itâs also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowskyâs ridiculous âmake a nanofactory in a beaker from first principlesâ strategy.
I still think this plan is doomed to fail (for early AGI). Itâs multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really canât avoid âbackflip stepsâ in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for ârunning a secret globe-spanning conspiracyâ, so it will inevitably make mistakes there. If we discover it before itâs ready to defeat us, it loses. Also, by the time it pulls the trigger on itâs plan, there will be other AGIâs around, and other examples of failed attacks that put humanity on alert.
A key crux here seems to be your claim that AIâs will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
Narrow AIs have moved from buggy/âmediocre to hyper-competent very quickly (months). If early AGIs are widely copied/âescaped, the global resolve and coordination required to contain them would be unprecedented in breadth and speed.
I expect warning shots, and expect them to be helpful (vs no shots), but take very little comfort in that.
Theyâve learned within months for certain problems where learning can be done at machine speeds, ie game-like problems where it can âplay against itselfâ or problems where huge amounts of data are available in machine-friendly format. But that isnât the case for every application. For example, developing self driving cars up to perfection level has taken way, way longer than expected, partially because it has to deal with freak events that are outside the norm, so a lot more experience and data has to be built up, which takes human time. (of course, humans are also not great at freak events, but remember weâre aiming for perfection here). I think most tasks involved in taking over the world will look a lot more like self-driving cars than playing Go, which inevitably means mistakes, and a lot of them.
I strongly agree with you on points one and two, though Iâm not super confident on three. For me the biggest takeaway is we should be putting more effort into attempts to instill âfalseâ beliefs which are safety-promoting and self-stable.
I could see this backfiring. What if insilling false beliefs just later led to the meta-belief that deception is useful for control?
thatâs a fair point, Iâm reconsidering my original take.