I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
I’m confused: surely we should want to avoid an AI coup? We may decide to give up control of our future to a singleton, but if we do this, then it should be intentional.
I agree we should try avoid an AI coup. Perhaps you are falling victim to the following false dichotomy?
We either allow a set of AIs to overthrow our institutions, or
We construct a singleton: a sovereign world government managed by AI that rules over everyone
Notably, there is a third option:
We incorporate AIs into our existing social, economic, and legal institutions, flexibly adapting our social structures to cope with technological change without our whole system collapsing
I wasn’t claiming that these were the only two possibilities here (for example, another possibility would be that we never actually build AGI).
My suspicion is that a lot of your ideas here sound reasonable on the abstract level, but once you dive into what it actually means on a concrete-level and how these mechanisms will concretely operate, it’ll be clear that it’s a lot less appealing. Anyway, that’s just a gut intuition, obvs. it’ll be easier to judge when you publish your write-up.
I’m excited to see you posting this. My views are very closely agreed with yours. I summarised my views a few days ago here.
One of the most important similarities is that we both emphasise the importance of decision-making and supporting it with institutions. This could be seen as “enactivist” view on agent (human, AI, hybrid, team/organisation) cognition.
The biggest difference between our views is that I think the “cognitivist” agenda (i.e., agent internals and algorithms) is as important as the “enactivist” agenda (institutions), whereas you seem to almost disregard the “cognitivist” agenda.
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
I disagree with putting risk-detection/mitigation mechanisms, algorithms, monitorings in that bucket. I think we should just separate between engineering (cf. A plea for solutionism on AI safety) and non-engineering (policy, legislature, treaties, commitments, advocacy) approaches. In particular, the “scheming control” agenda that you link will be concrete engineering practice that should be used in the training of safe AI models in the future, even if we have good institutions, good decision-making algorithms wrapped on top of these AI models, etc. It’s not an “alternative path” just for “non-AI-dominated worlds”. The same applies ftoor monitoring, interpretability, evals, etc. processes. All of these will require very elaborate engineering on their own.
I 100% agree with your reasoning about Frames 1 and 2. I want to discuss the following point in detail because it’s a rare view in EA/LW circles:
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
In my post, I also made a similar point: “aligning LLMs with human values” is hardly a part of [the problem of context alignment] at all”. But my framing was in general not very clear, so I’d try to improve it and integrate it with your take here:
Context alignment is a pervasive process that happens (and sometimes needed) on all timescales: evolutionary, developmental, and online (the examples of the latter in humans: understanding, empathy, rapport). The skill of context alignment is extremely important and should be practiced often by all kinds of agents in their interactions (and therefore we should build this skill into AIs), but it’s not something that we should “iron out once and for all”. That would be neither possible (agents’ contexts are constantly diverging from each other), nor desirable: the (partial) misalignment is also important, it’s the source of diversity that enables the evolution[1]. Institutions (norms, legal systems, etc.) are critical for channelling and controlling this misalignment so that it’s optimally productive and doesn’t pose excessive risk (though some risk is unavoidable: that’s the essence of misalignment!).
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Rafael Kaufmann and I have a take on this in our Gaia Network vision. Gaia Network’s term for internalised economic value of trade is subjective value. The unit of subjective accounting is called FER. Trade with FER induces flow that defines the intersubjective value, i.e., the “exchange rates” of “subjective FERs”. See the post for more details.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values
As I mentioned in the beginning, I think you are too dismissive of the “cognitivist” perspective. We shouldn’t paint all “micro-features of AIs” with the same brush. I agree that value alignment is over-emphasized[2], but other engineering mechanisms and algorithms, such as decision-making algorithms, “scheming control” procedures, context alignment algorithms, as well as architectural features: namely being world-model-based[3] and being amenable to computational proofs[4]are very important and couldn’t be recovered on the institutional/interface/protocol level. We demonstrated in the post about Gaia Network above that for for the “value economy” to work as intended, agents should make decisions based on maximum entropy rather than maximum likelihood estimates[5] and they should share and compose their world models (even if in a privacy-preserving way with zero-knowledge computations).
Indeed, this observation makes evident that the refrain question “AI should be aligned with whom?” doesn’t and shouldn’t have a satisfactory answer if “alignment” is meant to be “totalising value alignment as often conceptualised on LessWrong”; on the other hand, if “alignment” is meant to be context alignment as a practice, the question becomes as non-sensical (in the general form) as the question “AI should interact with whom?”—well, with someone, depending on the situation, in the way and to the degree appropriate!
However, still not completely irrelevant, at least for practical reasons: having shared values on the pre-training/hard-coded/verifiable level, as a minimum, reduces transaction costs because the AI agents shouldn’t then painstakingly “eval” each other’s values before doing any business together.
Which is just another way of saying that they should minimise their (expected) free energy in their model updates/inferences and the course of their actions.
I like your proposed third frame as a somewhat hopeful vision for the future.
Instead of pointing out why you think the other frames are poor, I think it would be helpful to maintain a more neutral approach and elaborate which assumptions each frame makes and give a link to your discussion about these in a sidenote.
The problem is that I am not trying to portray a “somewhat hopeful vision”, but rather present a framework for thinking clearly about AI risks, and how to mitigate them. I think the other frames are not merely too pessimistic: I think they are actually wrong, or at least misleading, in important ways that would predictably lead people to favor bad policy if taken seriously.
It’s true that I’m likely more optimistic along some axes than most EAs when it comes to AI (although I tend to think I’m less optimistic when it comes to things like whether moral reflection will be a significant force in the future). However, arguing for generic optimism is not my aim. My aim is to improve how people think about future AI.
Noted! The key point I was trying to make is that I’d think it helpful for the discourse to separate 1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former, and the latter has been discussed at more length elsewhere, it would make sense to further de-emphasize the latter.
1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former
My post aims at at both. It is a post about how to think about AI, and a large part of that is establishing the “right” framing.
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
I’m confused: surely we should want to avoid an AI coup? We may decide to give up control of our future to a singleton, but if we do this, then it should be intentional.
I agree we should try avoid an AI coup. Perhaps you are falling victim to the following false dichotomy?
We either allow a set of AIs to overthrow our institutions, or
We construct a singleton: a sovereign world government managed by AI that rules over everyone
Notably, there is a third option:
We incorporate AIs into our existing social, economic, and legal institutions, flexibly adapting our social structures to cope with technological change without our whole system collapsing
I wasn’t claiming that these were the only two possibilities here (for example, another possibility would be that we never actually build AGI).
My suspicion is that a lot of your ideas here sound reasonable on the abstract level, but once you dive into what it actually means on a concrete-level and how these mechanisms will concretely operate, it’ll be clear that it’s a lot less appealing. Anyway, that’s just a gut intuition, obvs. it’ll be easier to judge when you publish your write-up.
I’m excited to see you posting this. My views are very closely agreed with yours. I summarised my views a few days ago here.
One of the most important similarities is that we both emphasise the importance of decision-making and supporting it with institutions. This could be seen as “enactivist” view on agent (human, AI, hybrid, team/organisation) cognition.
The biggest difference between our views is that I think the “cognitivist” agenda (i.e., agent internals and algorithms) is as important as the “enactivist” agenda (institutions), whereas you seem to almost disregard the “cognitivist” agenda.
I disagree with putting risk-detection/mitigation mechanisms, algorithms, monitorings in that bucket. I think we should just separate between engineering (cf. A plea for solutionism on AI safety) and non-engineering (policy, legislature, treaties, commitments, advocacy) approaches. In particular, the “scheming control” agenda that you link will be concrete engineering practice that should be used in the training of safe AI models in the future, even if we have good institutions, good decision-making algorithms wrapped on top of these AI models, etc. It’s not an “alternative path” just for “non-AI-dominated worlds”. The same applies ftoor monitoring, interpretability, evals, etc. processes. All of these will require very elaborate engineering on their own.
I 100% agree with your reasoning about Frames 1 and 2. I want to discuss the following point in detail because it’s a rare view in EA/LW circles:
In my post, I also made a similar point: “aligning LLMs with human values” is hardly a part of [the problem of context alignment] at all”. But my framing was in general not very clear, so I’d try to improve it and integrate it with your take here:
Context alignment is a pervasive process that happens (and sometimes needed) on all timescales: evolutionary, developmental, and online (the examples of the latter in humans: understanding, empathy, rapport). The skill of context alignment is extremely important and should be practiced often by all kinds of agents in their interactions (and therefore we should build this skill into AIs), but it’s not something that we should “iron out once and for all”. That would be neither possible (agents’ contexts are constantly diverging from each other), nor desirable: the (partial) misalignment is also important, it’s the source of diversity that enables the evolution[1]. Institutions (norms, legal systems, etc.) are critical for channelling and controlling this misalignment so that it’s optimally productive and doesn’t pose excessive risk (though some risk is unavoidable: that’s the essence of misalignment!).
This is interesting. I’ve also discussed this issue as “morphological intelligence of socioeconomies” just a few day ago :)
Rafael Kaufmann and I have a take on this in our Gaia Network vision. Gaia Network’s term for internalised economic value of trade is subjective value. The unit of subjective accounting is called FER. Trade with FER induces flow that defines the intersubjective value, i.e., the “exchange rates” of “subjective FERs”. See the post for more details.
As I mentioned in the beginning, I think you are too dismissive of the “cognitivist” perspective. We shouldn’t paint all “micro-features of AIs” with the same brush. I agree that value alignment is over-emphasized[2], but other engineering mechanisms and algorithms, such as decision-making algorithms, “scheming control” procedures, context alignment algorithms, as well as architectural features: namely being world-model-based[3] and being amenable to computational proofs[4] are very important and couldn’t be recovered on the institutional/interface/protocol level. We demonstrated in the post about Gaia Network above that for for the “value economy” to work as intended, agents should make decisions based on maximum entropy rather than maximum likelihood estimates[5] and they should share and compose their world models (even if in a privacy-preserving way with zero-knowledge computations).
Indeed, this observation makes evident that the refrain question “AI should be aligned with whom?” doesn’t and shouldn’t have a satisfactory answer if “alignment” is meant to be “totalising value alignment as often conceptualised on LessWrong”; on the other hand, if “alignment” is meant to be context alignment as a practice, the question becomes as non-sensical (in the general form) as the question “AI should interact with whom?”—well, with someone, depending on the situation, in the way and to the degree appropriate!
However, still not completely irrelevant, at least for practical reasons: having shared values on the pre-training/hard-coded/verifiable level, as a minimum, reduces transaction costs because the AI agents shouldn’t then painstakingly “eval” each other’s values before doing any business together.
Both Bengio and LeCun argue for this: see “Scaling in the service of reasoning & model-based ML” (Bengio and Hu, 2023) and “A Path Towards Autonomous Machine Intelligence” (LeCun, 2022).
See “Provably safe systems: the only path to controllable AGI” (Tegmark and Omohundro, 2023).
Which is just another way of saying that they should minimise their (expected) free energy in their model updates/inferences and the course of their actions.
I like your proposed third frame as a somewhat hopeful vision for the future. Instead of pointing out why you think the other frames are poor, I think it would be helpful to maintain a more neutral approach and elaborate which assumptions each frame makes and give a link to your discussion about these in a sidenote.
The problem is that I am not trying to portray a “somewhat hopeful vision”, but rather present a framework for thinking clearly about AI risks, and how to mitigate them. I think the other frames are not merely too pessimistic: I think they are actually wrong, or at least misleading, in important ways that would predictably lead people to favor bad policy if taken seriously.
It’s true that I’m likely more optimistic along some axes than most EAs when it comes to AI (although I tend to think I’m less optimistic when it comes to things like whether moral reflection will be a significant force in the future). However, arguing for generic optimism is not my aim. My aim is to improve how people think about future AI.
Noted! The key point I was trying to make is that I’d think it helpful for the discourse to separate 1) how one would act in a frame and 2) why one thinks each one is more or less likely (which is more contentious and easily gets a bit political). Since your post aims at the former, and the latter has been discussed at more length elsewhere, it would make sense to further de-emphasize the latter.
My post aims at at both. It is a post about how to think about AI, and a large part of that is establishing the “right” framing.