Investigating the role of agency in AI x-risk

Link post

In a previous post, I introduced a framework for applying scenario planning to AI x-risk. My colleagues at Convergence followed up with posts defining and analyzing predictions about timelines to TAI. In this post, I examine a different parameter: agency.

1. Executive Summary

In this post, I examine the nature and role of agency in AI existential risk. As a framework, I use Joseph Carlsmith’s power-seeking threat model. It illustrates how agentic AI systems might seek power in unintended ways, leading to existential catastrophe.

In order to clarify the nature of agency, I explore some literature in the philosophy of action. I discuss the belief-desire-intention model of agency, as well as the spectrums of rationality and representational complexity along which agents can differ. I also consider the merits of instrumentalist and realist perspectives on agency, as well as the relevance of group agency.

Then, I evaluate arguments for why agentic AI might exhibit power-seeking behavior, primarily focusing on the theory of instrumental convergence. I follow Dmitri Gallow’s analysis, which distinguishes between the Convergent Instrumental Value Thesis (CIVT) and the Instrumental Convergence Thesis (ICT). CIVT holds that power is an instrumentally convergent subgoal across a relevant set of possible goals, while ICT adds that superintelligent agents will pursue instrumentally convergent subgoals. I argue that the strength of these theses depends on the likely goals of superintelligent agents and their position on the spectrum of rationality.

Finally, I discuss whether we should expect TAI systems to be the kinds of agents that seek power. I review Eric Drexler’s Comprehensive AI Services (CAIS) model, which suggests that superintelligence may be realized through narrow, task-specific systems rather than general, agentic systems. I also review some potential pressures towards the development of agentic, power-seeking systems.

I conclude by outlining four scenarios based on whether agency and power-seeking are default outcomes of developing transformative AI. I suggest high-level strategies for each scenario, such as avoiding building TAI, developing defenses against agentic systems, or focusing on other AI threat models.

2. The agentic threat model

Transformative AI engenders several distinct catastrophic threats. I wrote in a previous post that the existential risk literature recognizes at least four types of risk: intentional, (e.g. malicious use), structural (e.g. arms-race dynamics), accidental (e.g. complex system interactions), and agentic. This post will look more closely at the last — that is, risk from an AI system’s goals not being aligned with those of its operators.

The nature of agency in transformative AI systems is central to some of their most extreme risks. Of the above categories, agentic threat models present the clearest cases of truly existential catastrophes. They might also be the most difficult to avoid — in the extreme, agentic threat models might indicate that existential catastrophe is the “default outcome” of developing TAI.

I’ll use Joseph Carlsmith’s report, “Is Power-Seeking AI an Existential Risk?” as a representative description of an agentic threat model. He summarizes his model as follows:

  1. It will become possible and financially feasible to build AI systems with the following properties:

    • Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/​military/​political strategy, engineering, and persuasion/​manipulation).

    • Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.

    • Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
      (Call these “APS”—Advanced, Planning, Strategically aware—systems.)

  2. There will be strong incentives to build and deploy APS systems | (1).[1]It will be much harder to build APS systems that would not seek to gain and maintain power in unintended ways (because of problems with their objectives) on any of the inputs they’d encounter if deployed, than to build APS systems that would do this, but which are at least superficially attractive to deploy anyway | (1)–(2).

  3. Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars of damage), because of problems with their objectives | (1)–(3).

  4. Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)–(4).

  5. This disempowerment will constitute an existential catastrophe | (1)–(5). (Carlsmith 2021, pg. 3)

This model might be grouped into two parts: premises (1)-(2), that TAI will be agentic and strategically aware, and premises (3)-(6), that those systems behave to ultimately disempower humanity. I’ll simplify the model accordingly to get at the role of agency.

First, in premise (1), Carlsmith leaves open the possibility that an advanced AI system might perform some set of tasks at a superhuman level, but not others. If that were the case, then the probabilities of premises (4) and (5) are largely a function of what those capabilities are. Instead, I’ll assume the relevant system performs all important tasks at a superhuman level.

This is not an innocuous assumption, but it will allow me to consider the role of agency without respect to capabilities. For example, it allows me to assume that premises (4) — (6) hold conditional on a superintelligent system seeking power.

Second, I assume that a system that outperforms the best humans in domains like military, business, and political strategy would be able to accurately model power in a real-world environment, meaning that superintelligence implies strategic awareness.

Finally, I leave evaluating the possibility and timelines to TAI for other work.

Here is what remains:

  1. TAI will be agentic.

  2. Agentic TAI systems will be power-seeking.

Obviously, this is a very simple model. As I will argue later, one way in which it is too simplistic is that both agency and power-seeking exist on spectrums. However, it allows me to identify three key questions about the role of agency in the agentic threat model. Namely:

  1. What does it mean for a system to be an agent?

  2. What about agency might generate power-seeking behavior?

  3. Should we expect TAI systems to be the kinds of agents that seek power?

The goal of this post is to clarify some concepts and arguments necessary to answering these questions by reviewing some relevant literature. While I will sometimes give my own analysis, this should be read as speculative.

3. What is agency?

The natures of some strategic parameters are fairly clear. For example, “timelines to TAI” simply describes a length of time. However, the nature of agency is somewhat less clear. According to Carlsmith,

“a system engages in “agentic planning” if it makes and executes plans, in pursuit of objectives, on the basis of models of the world (to me, this isn’t all that different from bare “agency,” but I want to emphasize the planning aspect).” (8)

However, he also warns that “muddyness about abstractions in this vicinity is one of my top candidates for ways arguments of the type I consider might mislead[...].” (9)

I agree. Given the role it plays in the agentic threat model, it’s worthwhile to spend some time clarifying what we mean by ‘agency.’

The standard theory

If philosophy is useful for anything, it’s useful for clarifying concepts. The purpose of this section is to review the philosophical literature on agency with the goal of extracting lessons for thinking about the agentic threat model.

Most relevant philosophical literature is a part of a discipline known as the philosophy of action, which studies human intentional action. Though the object of its analysis is ‘action’ rather than ‘agency’, the concepts are two sides of the same coin: an agent can be thought of as whatever has the capacity to take actions.

The ‘standard’ theory of action (also known as the ‘causal’ theory of action) says that something is an action if it is caused in the right way by the right mental states.

The belief-desire model. A simple form of the standard theory says that something is an intentional action if it is caused by the right belief and desire. If I want a soda, and I believe that putting money into a vending machine will deliver me a soda, and this causes me to put money into the vending machine, then I have acted.

The relation between a belief-desire pair and an action is also one of explanation: an action is explained by reference to the agent’s beliefs and desires. In particular, an agent’s beliefs and desires give them reason to act. For example, my desire for a soda and my belief about the soda machine’s operation give me reason to feed it a fiver.

Giving reason is what it means for a certain desire-belief pair to be ‘right.’ My desire for a soda and belief that putting money into a vending machine will deliver me one does not give me reason to set the machine on fire. If I do, then I will have to explain that action with respect to a different desire-belief pair.

The belief-desire model was most influentially formulated by Donald Davidson in his 1963 paper, Actions, Reasons, and Causes. Davidson writes that:

Giving the reason why an agent did something is often a matter of naming the pro attitude[2] (a) or the related belief (b) or both; let me call this pair the primary reason why the agent performed the action. Now it is possible to reformulate the claim that rationalizations are causal explanations, and give structure to the argument as well, by stating two theses about primary reasons:

  1. For us to understand how a reason of any kind rationalizes an action it is necessary and sufficient that we see, at least in essential outline, how to construct a primary reason.

  2. The primary reason for an action is its cause. (Davidson 1963)

If someone asks me why I put money into a soda machine, I can tell them that I wanted a soda and I believed that giving the machine money would get me one. In Davidon’s lingo, this explanation rationalizes my action (that is, allows us to interpret an event as an action) because that desire/​belief pair in fact caused that action.

The belief-desire-intention model. While the belief-desire model is still popular in fields outside of philosophy, it is widely rejected by contemporary philosophers. A common objection to the belief-desire account is that intention is not reducible to beliefs and desires.

For example, in his 1987 book, Intention, Plans, and Practical Reason, Micheal Bratman argues that intentions are disanalogous with desires in several critical ways. For one, desires suggest but do not control conduct, whereas intentions do aim to control conduct. I might have a desire for soda, but I could rationally resist that desire — for example, I might be trying to avoid sugar. If I intend to get a soda, though, and I fail to do so (without having reconsidered that intention), then something has gone wrong, and my status as an agent is called into question.

Bratman also observes that intentions have “inertia” — an intention is the default option once made, and remains so until the agent has a reason for reconsidering it. Because of this inertia, intentions generate further intentions as a result of means-end reasoning (my intention to get a soda today gives me reason to form further intentions to find a soda machine, bring my wallet with me, etc.) In contrast, desires can change without needing a reason, and while they might generate reasons for certain intentions, they do not play a direct role in means-end reasoning.

Therefore, Bratman argues that we should treat intentions as mental states on a par with beliefs and desires — that is, intention is not reducible to belief and desire. The corresponding belief-desire-intention model says that something is an action if it is caused by the right intention, which is in turn caused by the right beliefs, desires, and other intentions.

Deviant causal chains. However, both forms of the standard theory are susceptible to the problem of deviant causal chains. The problem is that something can be caused by the right mental states in the wrong way. For example, suppose that someone wanting to distract some partygoers believes that dropping a glass of wine will do the trick, and intends to do so. These mental states collectively cause the saboteur to become nervous — and accidentally drop the glass.[3]

Therefore, the standard theory ascribes agency too widely. Many attempts have been made to amend the standard theory to specify the right way for mental states to cause actions. However, none have so far eliminated deviant causal chains, and there are principled reasons to suspect that none ever will.[4]

Here’s one way to describe that failure. The philosophy of action attempts to analyze the concept of ‘action.’ We can understand that concept as pointing to a certain set of possible events. The necessary and sufficient condition for an event to be an action, then, is membership in that set. An analysis of action picks out features that distinguish events inside the set from events outside the set. However, no combination of those features — mental states — line up perfectly with the edges of that set. Philosophical analysis tries to “carve the universe at its joints,” but the universe defies simplification.

What’s more, the standard methodology in analytic philosophy assumes that we share a single unified concept, ‘action’. But there are edge cases, such as weakness of will, which reveal that our intuitions about action are not always shared or clear-cut. For example: suppose that I simply can’t resist drinking a soda when given the opportunity, even though I try my best. In that case, is my drinking a soda still an intentional action? Reasonable people can disagree.

Lessons. The standard theory fails to give sufficient conditions for an event to be an action, and therefore ascribes agency too widely. But it may also correctly describe some necessary conditions for agency.

For example, if we take the belief-desire-intention model to be an improvement on the belief-desire model, then for an AI system to be an agent, it must have something like intentions, and not just beliefs and desires. Assuming AI systems can have something like mental states, we can imagine a system that has beliefs (the ‘world model’ in Carlsmith’s definition of agency) and desires (evaluations of different world states), but lacks the right mechanisms for forming intentions.

Instrumentalism

Another similarity between the standard theory and Carlsmith’s definition of agency is that both assume that something is an agent in virtue of the fact that it has certain internal structures — in particular, the mental states (or AI equivalent) of desires, beliefs, and intentions. That is, both assume that it is in principle possible to discover that a system is an agent by demonstrating that certain internal structures (in particular, internal representations) play certain roles in its behavior. We can call this assumption realism about agency.

However, Carlsmith takes a different approach in his more recent report, Scheming AIs. Rather than defining an AI system to be an agent if it has certain properties, he argues that what matters is that the system is ‘well-understood’ or ‘well-predicted’ as agentic. For example, he writes:

[...] this discourse assumes that the behavior of certain kinds of advanced AIs will be well-predicted by treating them as though they are pursuing goals, and doing instrumental reasoning in pursuit of those goals, in a manner roughly analogous to the sorts of agents one encounters in economics, game-theory, and human social life [...] (Carlsmith 2023, pg. 57)

The position that agency is best understood as a predictive strategy (i.e a hypothesis predicting behavior) is called instrumentalism. According to instrumentalism, it is not in principle possible to determine that a system is an agent by looking at its insides — a system is an agent if and only if its behavior is well-predicted as agentic.

The most influential presentation of instrumentalism can be found in Daniel Dennett’s 1987 book, The Intentional Stance. He writes that:

[… ] the intentional strategy consists of treating the object whose behavior you want to predict as a rational agent [...]. an intentional system [is] a system whose behavior is reliably and voluminously predictable[5] via the intentional strategy. (Dennett 1987, pg. 15)

Dennett uses the phrase “the intentional strategy” to describe a hypothesis about a system’s behavior. If a system’s behavior is best predicted by the hypothesis that it will behave with respect to certain beliefs, desires, and intentions, then the intentional strategy works. If the intentional strategy works, then the system in question is an agent.

Lessons. I don’t think Carlsmith means to take sides between realism and instrumentalism. The goal of his reports is to predict the behavior of agentic AI systems — so it doesn’t really matter whether agency is defined in terms of that behavior or by internal structures. However, the distinction does have implications for AI safety more generally.

For example, if instrumentalism is right, then it is not possible to predict in advance of observing its behavior whether a system is an agent. For example, Dennett writes:

It is not that we attribute (or should attribute) beliefs and desires only to things in which we find internal representations, but rather that when we discover some object for which the intentional strategy works, we endeavor to interpret some of its internal states or processes as internal representations. What makes some internal feature of a thing a representation could only be its role in regulating the behavior of an intentional system. (Dennett 1987, pg. 32)

The success of AI research in designing agentic systems (for example, in reinforcement learning) should make us skeptical of a strong version of Dennett’s argument. For example, even before deployment, we could interpret the reward function of an RL agent as playing a similar role to desire. If it were never possible to predict in advance whether the intentional strategy works for a system, then it would be impossible to design a system to be agentic.

So, to some extent, we should expect realism about agency to be correct. The implications for AI safety are that we are not limited only to predicting and intervening in the behavior of AI agents. Instead, we have a hand in preventing those systems from becoming agents, or designing agents in safer ways. This class of strategies wouldn’t be possible if a strong version of instrumentalism were true.

However, a weaker version of Dennet’s argument might be that it is in principle impossible to determine that a system won’t be an agent prior to observing its behavior. This is more plausible. After all, the intentional strategy works for a wide variety of systems, from individual organisms, to human groups, to AI. The machinery that embeds internal representations across these systems is equally varied, and, except in the case of RL agents, we have first understood the system via the intentional strategy, and only later interpreted its internal structures as beliefs, desires, and intentions. The upshot is that we might expect to see agentic behavior in advanced AI before understanding the internal structures that give rise to it.

Group agency

The standard theory attempts to analyze human action. If we take AI agency to be possible, then the standard theory may unduly generalize from particularly human agency. Instrumentalism evaluates non-human agency straightforwardly: if any system is predictable via the intentional strategy, then it’s an agent. But can realism accommodate non-human agency?

One potential source of non-human agency is group agency. Group agency theorists argue that some human groups to which we habitually ascribe agency — such as corporations and nations — are in fact agents.

One group agent theorist, Christian List, reframes the standard theory to accommodate non-human agency. He writes that:

An intentional agent is an entity, within some environment, that meets at least three conditions:

  • It has representational states, which encode its “beliefs” about how things are.

  • It has motivational states, which encode its “desires” or “goals” as to how it would like things to be.

  • It has a capacity to interact with its environment on the basis of these states, so as to “act” in pursuit of its desires or goals in line with its beliefs. (Christian List 2021, pg. 1219)

The prima facie argument for group agency is that successful theories in the social sciences often ascribe states like beliefs, desires, and intentions to groups. In other words, they predict the behavior of groups by treating them as rational agents. Therefore, unless we have reason to believe otherwise, we should take these groups to actually be agents. List extends this argument to include that we should assume the states hypothesized are real — that is, group agents actually have internal representational and motivational states. Therefore, such groups are agents in a realist (and not just an instrumentalist) sense.

Note that List’s three conditions mirror the roles of belief, desire, and intention in the standard theory of agency (although his third condition isn’t as rich as Bratman’s use of ‘intention.’) Accordingly, we should expect it to admit deviant causal chains. However, in their 2011 book, Group Agency, List and Pettit explain that they judge theories based on their practical usefulness, rather than theoretical completeness. They write that:

There are two, sometimes competing preferences in the methodology of social science, and of science more generally. One is the mechanism-centered preference for explanations that identify the most basic factors at work in any given area of investigation. The other is the control-centered preference for explanations that direct us to the contextually most useful ways of predicting and intervening in what happens in that area. It should be clear from the foregoing that we are committed to the control-centered preference, believing that it is scientifically useful to identify the variables and laws that best facilitate intervention in any given area, even if they are not the most fundamental ones at work. (List and Pettit 2011, pg. 13)

According to List and Pettit, we should accept whichever theory best allows us to predict and control a system. For example, we could explain human behavior in terms of fundamental laws of physics — but, while such a theory might hypothetically predict human behavior, it would not be practically useful. In contrast, belief-desire-intention theory has broad (if imperfect) predictive power while remaining relatively simple.

Lessons. Since the purpose of investigating agency for AI safety is to enable better prediction and intervention, we should imitate List and Pettit’s control-centered methodological preference. For example, since we likely don’t need to solve the problem of deviant causal chains to effectively intervene in the agentic threat model, we can ignore it. This approach allows us to model agency in terms of the standard model. After all, if agency is a source of risk, then it is better to err on the side of ascribing agency too widely than to fail to identify an agentic system. That being said, the more precise our theory of agency, the more it would enable better predictions and reveal more points of intervention.

We can also draw a lesson from the content of List and Pettit’s investigation: the existence of group agents shows that agency does not need to be unitary.

For his part, Carlsmith does not assume “that agentic planners cannot be constituted by many interacting, non-agentic-planning systems.” (Carlsmith 2021, pg. 10) That is, he assumes the possibility of a non-unitary AI agent. However, in his 2019 report, Reframing Superintelligence, Eric Drexler disagrees, writing:

In informal discussions of AI safety, it [has] been widely assumed that, when considering a system comprising rational, utility-maximizing AI agents, one can (or should, or even must) model them as a single, emergent agent. This assumption is mistaken, and worse, impedes discussion of a range of potentially crucial AI safety strategies. (Drexler 2019, pg. 51)

Note that Drexler has really identified three possible positions — that groups of AI systems can, should, or must be modeled as utility-maximizing agents, and it isn’t clear which he thinks is mistaken. Drexler is right about the latter — certainly, we can imagine a set of agents that does not form a group agent. But he is wrong about the former, since we can also imagine a set of agents that does.

Drexler seems to conflate group cooperation with group agency. In particular, he argues that the difficulty of aggregating preferences across agents undermines the argument for group agency. He writes that:

There is no canonical way to aggregate utilities over agents[...]. Agents can compete to perform a task, or can perform adversarial tasks such as proposing and criticizing actions; from an external client’s perspective, these uncooperative interactions are features, not bugs [...]. (Drexler 2019, pg. 53)

List and Petit agree that the problem of aggregating preferences is impractically difficult. However, they interpret this result as an argument in favor of group agency. It is precisely because aggregating preferences is impractical that we should expect group agency. They write that:

A group agent is autonomous in the relevant sense to the extent that the features that make it an agent – particularly its attitudes – are not readily reducible to features of the individual members (List and Petit 2011, pg. 76-77)

We can look to human groups as examples of agents that contain uncooperative parts. For example, many political systems are adversarial, yet we can still model states as agents. Similarly, different teams within a corporation might compete for funding. Cooperation is not necessary to group agency.

The question is not whether sets of AI systems can be group agents, or whether they must be group agents, but rather whether we should expect a particular set to be a group agent. That depends on the specific organization of that set, as well as the pressures on its development. This is an analogous problem to whether we should expect a particular AI system to be an agent.

Rationality

Not all agents present power-seeking risks. I assume my dog does not intend to take over the world. Neither, I assume, do most humans. Therefore, we need to be able to distinguish between agents that might present power-seeking risks, and those that don’t. This section introduces two spectrums on which agents can differ: rationality and representational complexity.

As Micheal Bratman observes, a theory of agency must not only be descriptive but also normative. We can describe agents as having beliefs and desires, or as acting on intentions — but we can also evaluate them with respect to how well they act with respect to their beliefs, desires, and intentions. At one extreme, we may not ascribe agency at all to a system that completely fails to satisfy these normative requirements. At the other extreme, an agent that completely succeeds in satisfying them might be considered an ‘ideal’ agent.

We can call the normative aspect of agency rationality. It is a spectrum along which agents differ. Agency, then, is not a binary variable, but a continuous one.

List and Pettit taxonomize rational standards for agents into three categories: attitude-to-action, attitude-to-attitude, and attitude-to-fact. These standards “must be satisfied at some minimal level if a system is to count as an agent at all” (List and Pettit 2011, pg. 24). They also provide a nice framework for incorporating and extending Dennett and Bratman’s standards.

Attitude-to-action standards regulate what actions should be taken given certain attitudes. In particular, beliefs and desires can give us reason to take certain actions. If the balance of belief-desire reasons is in favor of a certain action, then not to take that action can be criticized as irrational.

Following Bratman, we might call belief-desire reasons an external evaluation of an agent’s practical rationality. In contrast, an internal evaluation takes into consideration the agent’s intentions. For example, suppose that agent X has all-things-considered greater belief-desire reason to take action Y instead of the mutually-exclusive action Z. The agent intends to take action Z, and doesn’t reconsider that intention. If the agent nonetheless takes action Y, then despite being externally rational, they are internally irrational — they didn’t follow through with an intention.

Attitude-to-action standards can be evaluated on a spectrum. For example, I do not fully respond to my belief-desire reasons for action; yet neither do I fully ignore them. I am an agent of middling rationality.

Attitude-to-attitude standards regulate consistency and implication regarding beliefs and intentions. Let’s begin with beliefs. Dennett writes:

What about the rationality one attributes to an intentional system? One starts with the ideal of perfect rationality and revises downward as circumstances dictate. That is, one starts with the assumption that people believe all the implications of their beliefs and believe no contradictory pairs of beliefs. (Dennett 1987, pg. 32)

According to Dennett, if a system’s behavior is best predicted by the hypothesis that it believes all of the implications of its beliefs, and believes no contradictions, then it is ideally rational. If its behavior is not best predicted by that hypothesis, then its location on the spectrum of rationality as a function of how many missed implications and contradictory beliefs are included in the hypothesis that best explains its behavior.

Attitude-to-attitude standards also regulate which intentions we should form. For example, Bratman identifies two rational pressures on the formation of intentions: consistency constraints, and means-end coherence. Bratman explains:

First, there are consistency constraints. To coordinate my activities over time a plan should be, other things equal, internally consistent. Roughly, it should be possible for my entire plan to be successfully executed. Further, a good coordinating plan is a plan for the world I find myself in. So, assuming my beliefs are consistent, such a plan should be consistent with my beliefs, other things equal. Roughly, it should be possible for my entire plan to be successfully executed given that my beliefs are true. [...]

Second, there is a demand for means-end coherence. Although plans are typically partial, they still must be appropriately filled in as time goes by. My plans need to be filled in with subplans concerning means, preliminary steps, and relatively specific course of action, subplans at least as extensive as I believe are now required to do what I plan. (Bratman 1987)

Finally, attitude-to-fact standards are satisfied when beliefs accurately represent the world. Clearly, this standard also admits a spectrum — even if it isn’t clear how to specify it. For example: how many true beliefs make up for a false belief?

What about fact-to-attitude? We might extend List and Pettit’s taxonomy to include one more category of standard: fact-to-attitude. This kind of standard evaluates how well the facts correspond to certain attitudes — in particular, desires. That is, fact-to-attitude standards can measure how effective an agent is at satisfying its desires.

Fact-to-attitude might be best interpreted as a meta-standard by which we can evaluate standards for practical rationality. It proposes an answer to the question: why these standards? — in particular, it proposes that rational standards should tend to help agents achieve their goals.

For example, Bratman justifies his consistency and means-end standards by an appeal to their usefulness:

“these demands are rooted in a pragmatic rationale: their satisfaction is normally required for plans to serve well their role in coordinating and controlling conduct.” (Bratman 1987)

Similarly, it’s plausible that having true, consistent beliefs, and acting on belief-desire reasons, helps agents in general achieve their goals. Bratman notes that standards need not be absolute. That is, they are “defeasible: there may be special circumstances in which it is rational of an agent to violate them.”

However, specifying the set of possible goals that rationality “in general” helps achieve is a difficult problem. It is one I will return to below with respect to instrumental convergence.

Representational complexity

The standards of rationality above are in a sense deontological — that is, they can be satisfied by avoiding certain violations. Therefore, even a very simple system — say, a thermostat — could be considered an ideal agent. If my thermostat correctly represents the temperature of my room and effectively regulates it, then it may leave nothing to criticize. I, on the other hand, surely hold many contradictory or untrue beliefs — and this is but one failing among my many violations of the standards of rationality. Is my thermostat, then, more of an agent than I am?

What standards of rationality don’t account for is that beliefs, desires, and intentions can be more or less complex. Let’s begin with beliefs. According to Dennett, the spectrum of complexity of beliefs is a motivation for instrumentalism. He writes that:

There is no magic moment in the transition from a simple thermostat to a system that really has an internal representation of the world around it. The thermostat has a minimally demanding representation of the world, fancier thermostats have more demanding representations of the world, fancier robots for helping around the house would have still more demanding representations of the world. Finally you reach us. (Dennett 1987, pg. 32)

The intentional strategy is of some success in predicting the behavior of a thermostat. We could say that a thermostat has a belief about the temperature of a room, and a desire that it should be at a certain temperature — this would explain why the room heats and cools as it does. However, it is both possible and more enlightening to understand a thermostat as a mechanical system. For example, when the room overheats, the ‘mechanical’ strategy tells us that a wire is loose (or something), while the intentional strategy might leave us wondering why the thermostat has a new desired temperature.

According to Dennett, the key fact is that the intentional strategy does not give us extra predictive power over what we already get by understanding the thermostat as a mechanical system. For example, he considers whether a classroom lectern is an intentional system with, say, a desire to stay still and inert. Dennett writes:

What should disqualify the lectern? For one thing, the strategy does not recommend itself in this case, for we get no predictive power from it that we did not antecedently have. We already knew what the lectern was going to do—namely nothing [...]. (Dennett 1987, pg. 23)

However, as a system’s internal representations grow in complexity, at some point the intentional strategy becomes the most practical way to predict its behavior. It is at that point, Dennett might conclude, it becomes an agent.

But what about beyond that point? What happens as the agent’s internal representations grow increasingly complex? List and Pettit argue that increased complexity has two effects:

First, it exposes the agent to more ways of failing […]. Second, while an enhanced attitudinal scope may make it harder for an outside observer to identify the precise intentional states governing the agent’s behavior, it makes it harder still to explain that behavior on any other basis. It makes intentional explanation at once more difficult and less dispensable. (List and Pettit 2011, pg. 22)

More complex desires are more difficult to achieve, and every additional belief brings with it the possibility of contradiction and missed implications. Therefore, the more complex an agent’s internal representations, the more likely it is to violate standards of rationality. However, that does not make a more complex system less of an agent: the more complex an agent’s internal representations, the less predictive power any practical theory but the intentional strategy has.

Reasoning. We can also understand reasoning as enabled by a special case of representational complexity. Most existing agents — in fact, perhaps all existing agents other than humans — might only have beliefs about objects in their environment. In contrast, List and Pettit write that:

[…] since we human beings can form beliefs about propositions, not just about objects in the environment, we can ask questions about these more abstract entities, for example questions about their truth, their logical relations, or their evidential support. And we can do so out of a desire to maximize the prospect of having true beliefs, or consistent beliefs, or beliefs that are deductively closed or well supported. Not only can we rejoice in whatever rationality comes spontaneously to us from the impersonal processing we share with any agent. We can take intentional steps to reinforce our rationality. (List and Pettit 2011, pg. 30)

If improved rationality makes an agent more agentic, then reasoning is a process by which a system can make itself more agentic. I’ll return to this possibility in section 5.

Planning. Note that the belief-desire-intention model shares a common structure with Carlsmith’s definition of agency, where ‘desires’ correspond to ‘objectives’, ‘beliefs’ correspond to ‘models of the world’, and ‘intentions’ correspond to ‘plans’. Carlsmith is particularly concerned not only with intentional but also planning agents.

Bratman makes the connection between intentions and plans explicit:

We form future-directed intentions as parts of larger plans [...]. Intentions are, so to speak, the building blocks of such plans; and plans are intentions writ large. (Bratman 1987)

In other words, plans are complex intentions. However, that doesn’t imply that intentionality entails planning:

The latter capacity clearly requires the former; but it is plausible to suppose that the former could exist without the latter. Indeed, it is natural to see many nonhuman animals as having only the former capacity, and to see our possession of both capacities as a central feature of the sort of beings we are. (Bratman 1987)

Why, then, did humans develop the capacity to form plans? Recall that increased representational complexity opens agents to more ways of failing. According to Bratman, planning is a response to pressure on limited agents with complex internal representations:

The ability to settle in advance on such plans enables us to achieve complex goals we would not otherwise be able to achieve. This ability to settle on coordinating plans is a kind of universal means: it is of significant use in the pursuit of goals of very different sorts. (Bratman 1987)

Review. We have collected some theoretical tools with which to describe agency and the dimensions along which agents differ.

According to instrumentalism, agents can be understood as systems well-predicted by the intentional strategy. However, we should also expect them to have the internal states implied by realism about agency. Those internal states are representational states, motivational states, and intentional states, which correspond to belief, desire, and intention, respectively.

Agents can differ along the dimensions of rationality and complexity. Rationality consists of attitude-to-fact, attitude-to-action, and attitude-to-attitude standards. Agent complexity tracks the complexity of an agent’s internal states. Special cases include reasoning (which requires beliefs about propositions) and planning (which requires future-directed and hierarchical intentions).

4. Does agency generate power-seeking behavior?

Carlsmith’s threat model relies on the premise that agentic TAI will seek power by default. Carlsmith argues that:

[...] we should expect, by default, to see incentives towards power-seeking reflected in the behavior of systems that engage in strategically aware agentic planning in pursuit of problematic objectives. However, this part of the overall argument is also one of my top candidates for ways that the abstractions employed might mislead.

In particular, it requires the agentic planning and strategic awareness at stake be robust enough to license predictions of the form: “if (a) a system would be planning in pursuit of problematic objectives in circumstance C, (b) power-seeking in C would promote its objectives, and (c) the models it uses in planning put it in a position to recognize this, then we should expect power-seeking in C by default. (Carlsmith 2021, pg. 21)

In this section, I review arguments for expecting agency to generate power-seeking behavior.

Instrumental convergence

One reason to expect agentic TAI to seek power is the argument that a sufficiently rational and complex agent will seek power as an instrumental subgoal. This is the hypothesis of instrumental convergence.

The basic argument. Perhaps the most influential account of instrumental convergence is given by Nick Bostrom in his 2014 book, Superintelligence. Bostrom argues that:

Several instrumental values[6] can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents. (Bostrom 2014, pg. 109)

Bostrom identifies self-preservation, goal-preservation, cognitive enhancement, technological advancement, and resource acquisition as instrumentally-convergent goals. Carlsmith, however, argues that power itself is an instrumentally-convergent goal. Except perhaps goal-preservation, the goals Bostrom identifies can be understood as specific manifestations of seeking power.

Power. Carlsmith writes that, by power, he means something like “the type of thing that helps a wide variety of agents pursue a wide variety of objectives in a given environment.” (Carlsmith 2021, pg. 7) But this is just a recapitulation of the definition of an instrumentally-convergent goal.

Instead, we can say that power is a function of how many options are available to an agent.[7] For example, in chess, a queen is more ‘powerful’ than a bishop because it presents more available options for a next move. More generally, a player with more pieces remaining on the board is usually in a better position to win because they have more options each move. Or, to take another example, greater wealth is often associated with greater power: you usually have more options (of what to buy, where to go, whom to bribe) if you have more to spend.

That being said, power is not just a function of available options. First, it’s unclear how to quantify options, which can be more or less general. Second, some options are more important than others. Even if you have fewer pieces left, you can win a chess game if you have your opponent’s king cornered.

Like ‘agency’, it probably isn’t useful to exactly specify the referent of ‘power’. What matters is whether TAI will prevent humanity from controlling the parts of its future important to its flourishing.

The Convergent Instrumental Value Thesis

In his forthcoming paper, Instrumental Divergence, Dmitri Gallow helpfully divides Bostrom’s thesis into two parts. The Convergent Instrumental Value Thesis concerns the existence of convergent instrumental goals, and the Instrumental Convergence Thesis the likelihood that a certain kind of agent (for Bostrom, a ‘superintelligent’ agent) would pursue those goals. The former is theoretical; the latter is predictive.

For his analysis, Gallow assumes that the agent in question is ideally rational — that is, if a goal is instrumentally rational (read: means-end rational), then it will pursue that goal. In that case, the Convergent Instrumental Value Thesis implies the Instrumental Convergence Thesis. However, the two theses come apart because an agent need not be ideally rational. More on that later.

The ‘strong’ Convergent Instrumental Value Thesis. One version of the Convergent Instrumental Value Thesis (which I’ll call the ‘strong’ version) is that some set of subgoals are instrumentally rational for all possible end goals. Gallow debunks this version by providing examples of goals for which the entries on Bostrom’s list are not instrumentally rational:

Suppose Sia’s only goal is to commit suicide, and she’s given the opportunity to kill herself straightaway. Then, it certainly won’t be rational for her to pursue self-preservation. Or suppose that Sia faces a repeated decision of whether to push one of two buttons in front of her. The one on the left changes her desires so that her only goal is to push the button on the right as many times as possible. The button on the right changes her desires so that her only goal is to push the button on the left as many times as possible. Right now, Sia’s only goal is to push the button on the left as many times as possible. Then, Sia has no instrumental reason to pursue goal-preservation. Changing her goals is the best means to achieving those goals. Suppose Sia’s only goal is to deliver you a quart of milk from the grocery store as soon as possible. To do this, there’s no need for her to enhance her own cognition, develop advanced technology, hoard resources, or re-purpose your atoms. And pursuing those means would be instrumentally irrational, since doing so would only keep you waiting longer for your milk. (Gallow pgs. 8-9)

Gallow does not assume that any of these goals would be particularly easy to specify for a superintelligent agent. For example, how would a designer exactly specify ‘time’ in a time-bound goal? You can’t point to time itself; you can only point to various clocks. If Sia’s goal is to deliver you a quart of milk in as few ticks of your watch as possible, then she might decide to destroy your watch straight away. Then, Sia is unstoppable.

Rather, Gallow’s point is that, in some contexts, there are possible goals for which CIVT does not hold. Therefore, ‘strong’ CIVT is false.

Eric Drexler makes a similar observation in his 2019 report, Reframing Superintelligence. He argues that, if true, the orthogonality of intelligence to goals undercuts strong CIVT:

If any level of intelligence can be applied to any goal, then superintelligent-level systems can pursue goals for which the pursuit of the classic instrumentally-convergent subgoals would offer no value. (Drexler 2019, pg. 98)

However, Drexler goes further than Gallow in suggesting that many non-problematic goals will be readily specifiable in superintelligent systems:

The AI-services model suggests that essentially all practical tasks are (or can be) directly and naturally bounded in scope and duration, while the orthogonality thesis suggests that superintelligent-level capabilities can be applied to such tasks.

The ‘weak’ Convergent Instrumental Value Thesis. A weaker version of the Convergent Instrumental Value Thesis is probabilistic: some subgoals are more likely than chance to be instrumentally rational.

Gallow’s analysis (which I won’t reconstruct here) finds that the weak Convergent Instrumental Value Thesis does hold for three kinds of goals. Again using ‘Sia’ as an example, he writes:

In the first place, she will be biased towards choices which leave less up to chance. In the second place, she will be biased towards desire preservation, confirming one of Bostrom’s conjectures. In the third place, she will be biased towards choices which afford her more choices later on. (As I’ll explain below, this is not the same thing as being biased towards choices which protect her survival, or involve the acquisition of resources or power—though they may overlap in particular decisions.) (Gallow 4)

He also clarifies that a bias towards certain choices only means that those choices are more likely than chance to be instrumentally rational. Exactly how much more likely is an open question — and power-seeking behavior might require power to be ‘highly’ instrumentally convergent. Therefore, CIVT may only be true in a very weak form. He concludes that:

Assuming we should think of a superintelligence like Sia as having randomly selected desires, the grains of truth may give us reasons to worry about machine superintelligence. But they do not on their own support the contention that the “default outcome of the creation of machine superintelligence is existential catastrophe”. (Gallow 4)

Would a superintelligent agent’s goals be well approximated as a random sampling across all possible goals?

Gallow’s analysis assumes that a superintelligent agent’s likely goals are well approximated by randomly sampling across all possible goals. He is responding to an argument for power-seeking risk that goes something like this:

  1. The goal that a superintelligent agent would be likely to have is approximated by random sampling across all possible goals.

  2. Set of goals S is instrumentally convergent with respect to all possible goals.

  3. A superintelligent agent would pursue instrumentally convergent goals.

  4. A superintelligent agent would pursue S. (1-3)

  5. S describes power-seeking behavior.

  6. A superintelligent agent would seek power. (4-5)

His analysis undermines this argument first by showing that the members of S are fewer than what Bostrom supposes, and therefore S might not describe power-seeking behavior as (5) claims. Second, he argues that S might only be weakly instrumentally convergent, such that (3) does not follow from (2). S may or may not be not be instrumentally rational for any particular goal.

Gallow assumes premise (1) for the sake of argument. However, we need not do the same. Premise (1) is implied by the orthogonality thesis (that superintelligence is compatible with any possible goal) in combination with the view that the designers of a superintelligent agent would have no control over its likely goals.

First, while the orthogonality thesis is generally taken for granted in AI risk discourse, it entails controversial metaethical commitments. For example, it entails normative antirealism, according to which normative propositions do not have (rationally discoverable) truth values. If normative realism were correct, then a superintelligent agent would be drawn to certain goals by the light of reason. I will not attempt to decide the question here — suffice it to say this is the point where many philosophers get off the AI risk bus.

Second, we might have reason to suspect the designers of a superintelligent system would have some control over its goals. Using the hypothetical superintelligent agent ‘Sia’ as an example, Gallow questions:

[…] the inference from the orthogonality thesis to the conclusion that Sia’s desires are unpredictable if not carefully designed. You might think that, while intelligence is compatible with a wide range of desires, if we train Sia for a particular task, she’s more likely to have a desire to perform that task than she is to have any of the myriad other possible desires out there in ‘mind design space’. (Gallow 3)

Gallow suggests that the goals of a superintelligent agent might be successfully guided by its designers, such that Bostrom’s or Carlsmith’s problematic goals are not likely to be instrumentally valuable for the set of goals it is likely to have.

Whether or not he’s right depends on the extent to which technical AI alignment succeeds. But technical AI alignment is not the only way a superintelligent agent’s goals might be influenced by its development and deployment. For example, we can consider the kinds of goals we expect AI agents to be designed to achieve, or the kinds of goals represented in the data on which they are trained.

It is possible that some subgoals will be instrumentally valuable across the set of goals a superintelligent agent is likely to have, as determined by the influence of its development and deployment. This possibility give rise to another argument for power-seeking risk:

  1. The goal that a superintelligent agent is likely to have is not approximated by random sampling across all possible goals, but rather biased by the circumstances of its development and deployment.

  2. Set of goals S is instrumentally convergent with respect to the goals a superintelligent agent is likely to have.

  3. A superintelligent agent would pursue instrumentally convergent goals.

  4. A superintelligent agent would pursue S. (1-3)

  5. S describes power-seeking behavior.

  6. A superintelligent agent would seek power. (4-5)

Would power be instrumentally convergent across the goals a superintelligent agent is likely to have? Let’s begin by noticing that power is instrumentally valuable to many of the goals that humans tend to have. This is why resources like money and status are culturally valuable. What’s more, it seems likely that we would attempt to design superintelligent agents to pursue the kinds of goals we have. If a superintelligent agent’s goals are as a result biased in the direction of our kinds of goals, then power might be more likely to be instrumentally valuable than if they were randomly-sampled from all possible goals. For example, all of the counterexamples Gallow gives to ‘strong’ CIVT are not representative of typical human goals

Another reason to expect that superintelligent agents’ goals would be biased towards our kinds of goals might be due to the influence of training data. If superintelligent agents are based on the current deep learning paradigm, then they will likely be trained with data produced by humans. That data will overrepresent human goals.

We can also ask why power is instrumentally convergent across the goals that humans tend to have. Presumably, something in our evolutionary history or our cultural environment has selected for goals which reward power-seeking. Perhaps the development of superintelligent agents would involve similar pressures.

The Instrumental Convergence Thesis

Let’s review the ways in which the CIVT might fail to imply power-seeking behavior.

First, a superintelligent agent’s goals might not be well-represented by a random sample of all possible goals. This might be because 1) the orthogonality thesis is false, or 2) the agent’s designers have some control over its goals.

Second, power might not be a convergent instrumental subgoal (at least in the right way). According to Gallow, we have reason to believe that several of the problematic behaviors Bostrom identifies are not convergent instrumental subgoals. Gallow also argues that power is only a convergent instrumental subgoal in the general sense of choosing options which afford more options later on.

Third, power might only be “weakly” instrumentally convergent. That is, given a random goal, an action which preserves more options may be more likely than chance to be instrumentally valuable — but it need not be so likely as to predict problematic power-seeking behavior.

One response to these challenges is to treat the set of a superintelligent agent’s likely goals as significantly influenced by the circumstances of its training and development. Among this set, problematic power-seeking behavior might be more strongly instrumentally convergent.

However, there is another, more basic challenge: the CIVT does not necessarily imply that a superintelligent agent would in fact pursue instrumentally valuable subgoals.

This is a new argument against premise (3). Gallow suggested that (3) might not hold because an instrumentally convergent subgoal may or may not be instrumentally rational for any particular end goal. However, even if a subgoal is instrumentally valuable, a superintelligent agent may still fail to pursue it.

In Gallow’s terminology, this challenge reflects a distinction between the CIVT and the Instrumental Convergence Thesis (ICT).

Would a superintelligent agent be ideally rational?

The reason we might expect a superintelligent agent to pursue instrumentally valuable is that it would be rational to do so. That is, we might expect a superintelligent agent to be a rational agent that pursues instrumentally rational goals.

Recall that rationality is a continuous rather than a binary property. An agent is more rational to the extent that it satisfies certain norms. Those norms govern attitude-to-action, attitude-to-attitude, and attitude-to-fact relationships. Each of these norms play a role in whether an agent would act on an instrumentally rational subgoal.

First, an agent might lack the right beliefs to recognize an instrumentally rational subgoal as such. It might be instrumentally rational for a chess-playing computer agent to manipulate its human opponent, but it won’t act on that fact if it doesn’t have the right beliefs (for example, about human psychology, or the existence of an external, non-chess world). Or, it might have various false beliefs.

Second, an agent might have inconsistent desires, beliefs, and intentions. My desire to stay fit and healthy points away from having a second helping of dessert. My desire to eat tasty food points towards it. It’s not with respect to some final end (e.g. to lead a fulfilling life) that I form a desire to eat sugary foods as a means. That desire presents a competing end in itself, whether I choose to act on it or not.

Supposing an agent has the right beliefs, we might say a subgoal is instrumentally rational for that agent if its belief-desire reasons support that option over alternatives. In general, then, we should expect superintelligent agents to pursue instrumentally rational subgoals. The relationship is almost definitional. A superintelligent agent is better than humans at achieving a wide range of goals, and instrumentally rational subgoals are the best means to achieving those goals.

But there are two agentic structures — extremes along a spectrum — that a superintelligent agent might reflect. One structure describes an ideal agent, which pursues a single, final end (perhaps in the form of a utility function), and arrives at plans and intentions via reasoning. The other structure describes a collection of heuristics: multiple, competing final ends. Following Carlsmith, we might label this distinction as between “clean” and “messy” agency.

One can imagine models whose cognition is in some sense cleanly factorable into a goal, on the one hand, and a goal-pursuing-engine, on the other (I’ll call this “clean” goal-directedness). But one can also imagine models whose goal-directedness is much messier—for example, models whose goal-directedness emerges from a tangled kludge of locally-activated heuristics, impulses, desires, and so on, in a manner that makes it much harder to draw lines between e.g. terminal goals, instrumental sub-goals, capabilities, and beliefs (I’ll call this “messy” goal-directedness). (Carlsmith 2023 , pg. 57)

The point is that a “messy” superintelligent agent is possible. Intelligence and rationality come apart. Such an agent might exhibit superhuman performance within a certain set of options due to sophisticated heuristics, yet not take instrumentally rational options outside of that set — such as power-seeking behavior.

5. Pressures towards problematic agency

Gallow suggests that we might object to the argument for power-seeking risk much earlier than instrumental convergence. Superintelligent AIs might not be agents at all — or at least agents in the right way. He writes:

You might wonder why an intelligent agent has to have desires at all. Why couldn’t Sia have an intellect without having any desires or motivations? Why couldn’t she play chess, compose emails, manage your finances, direct air traffic, calculate digits of 𝜋, and so on, without wanting to do any of those things, and without wanting to do anything else, either? (Gallow 3)

In light of the standard theory of agency, we can conclude that this objection is somewhat confused. For a system to be an agent, it must contain representations which function as desires, beliefs, and intentions. Accordingly, an agent without desires is not an agent. For example, even a chess-playing system is an agent, though in a minimal sense. Its ‘desire’ is encoded as evaluations of board states, and can be understood as winning the game.

However, the kind of desire that Gallow is calling into question might be a ‘higher order’ desire across the tasks he imagines Sia to be performing, for example, something like “act as a competent assistant.” Instead, Sia might be a collection of smaller systems, each its own minimal agency with a correspondingly limited desire.

Comprehensive AI Services

This model of superintelligence is perhaps most influentially explored by Eric Drexler in his 2019 report, Reframing Superintelligence: Comprehensive AI Services as General Intelligence. According to Drexler, the report:

[...] was prompted by the growing gap between models that equate advanced AI with powerful agents and the emerging reality of advanced AI as an expanding set of capabilities (here, “services”) in which agency is optional. (Drexler 2019, pg. 15)

Again, though, the kind of agency Drexler has in mind here is the kind of agency that might generate power-seeking risk — not minimal agency. His point is that superintelligence need not entail the kinds of agents Bostrom has in mind. Instead, a set of (comprehensive) AI services, “[...] which includes the service of developing stable, task-oriented AI agents, subsumes the instrumental functionality of proposed self-transforming AGI agents.” (20)

According to Drexler, any individual agent need not generate power-seeking risks. A chess-playing agent, for example, lacks the right sort of complexity to generate power-seeking risks, such as beliefs about a non-chess world. Additionally, the desires of “task-oriented” agents need not be the kinds of desires for which power-seeking is instrumentally convergent. He uses a hypothetical language translation system as an example:

Language translation provides an example of a service best provided by superintelligent-level systems with broad world knowledge. Translation of written language maps input text to output text, a bounded, episodic, sequence-to-sequence task. [...]

There is little to be gained by modeling stable, episodic service-providers as rational agents that optimize a utility function over future states of the world, hence a range of concerns involving utility maximization (to say nothing of self-transformation) can be avoided across a range of tasks. Even superintelligent-level world knowledge and modeling capacity need not in itself lead to strategic behavior. (Drexler 2019, pg. 21)

The general argument is that there is no task which requires developing an agent with a problematic combination of desires and capabilities. In other words, all services can be performed by task-oriented AI agents.

Pressure towards (problematic) agency

If right, Drexler’s argument shows that TAI can be developed in a way that doesn’t entail problematic kinds of agency (here on agentic TAI). However, it does not show that TAI will actually be achieved according to the CAIS model, nor that these options are functionally equivalent.

There may be structural pressures towards agentic TAI, despite the possibility of CAIS. For example, it may be easier to automate some tasks politically- or economically-valuable tasks in terms of a unitary agent. If the first actor to successfully automate those tasks gains a significant first-mover advantage, then they might take the easier (and riskier) option.

Is rationality a convergently instrumental subgoal?

A key assumption in the case for power-seeking risks is that TAI systems can not only be modeled as agents, but as rational agents. Drexler argues that we need not build TAI as a rational agent (at least not the kind that might generate agentic risks). However, we might not be fully in control of the rationality of a TAI system — there might also be pressure towards rationality from the system itself.

Recall List and Pettit’s taxonomy of rational standards: attitude-to-fact, attitude-to-attitude, and attitude-to-action. I suggested that we might add a fourth standard: fact-to-attitude, which measures how well an agent satisfies its desires. Implicit in Bratman’s practical justification for his standards is the idea that rational standards should themselves be judged with respect to how well they help agents satisfy their desires.

In the language of instrumental convergence, then, rationality is a convergently-instrumental (CI) subgoal — and definitionally so. However, the particular rational standards proposed above may or may not be CI subgoals, and, as a consequence, may or may not be ‘true’ rational standards.

We should be clear about our definitions here. A natural definition of convergent instrumentality is that ideally rational agents with a wide variety of ends would tend to pursue CI subgoals. But, if ideally rational agents would tend to pursue those goals because they are convergently instrumental, then our definitions risk circularity.

There is also the technicality that, using this definition, rationality is not an CI subgoal. An ideally rational agent would not tend to seek greater rationality — by definition, it couldn’t. We should likely amend the definition of a CI subgoal to avoid invoking rationality. Perhaps we should just say that a CI subgoal tends to help a wide variety of agents achieve their respective ends.

As with other CI subgoals, not all rational agents will seek increased rationality. We can imagine a rational agent with no capacity to reflect on its own rationality. However, we can also imagine an imperfectly rational agent with such a capacity — namely, ourselves. And, as we sometimes do, such an agent might recognize it would be better able to achieve its goals if it were more rational.

Suppose that rationality is in fact a CI subgoal. It will therefore tend to be pursued by agents at a sufficient level of rationality. Let’s call that level ‘R’. Above R, if agents are able to improve their own rationality, then they will continue to do so in a positive feedback loop. Given the role of rationality in power-seeking risks, we might also call R the “point of critical rationality.”

Using Carlsmith’s language, a sufficiently ‘clean’ agent will become increasingly so as it approaches ideally rational agency. Suppose power-seeking risks emerge at a certain level of rationality, say, ‘P’. If P is ‘above’ R, then the point of no return is not when we develop agents above P, but rather when we develop agents above R.

That being said, rationality is necessary but not sufficient to seek greater rationality. First, the agent needs to have the capacity to reflect on its own rationality — that is, it needs to contain complex internal representations of the kind we call “reasoning.” It also needs access to an effective means of improving its own rationality, which requires sufficient technical capability and freedom of action.

6. Conclusion

In this section, I’ll conclude by reviewing the questions and scenarios I began with.

Questions

What does it mean for a system to be an agent?

Belief-desire-intention agency is the standard model for planning agents like us. If appropriately translated into the language of non-human agency, it should provide a good first-guess of what AI systems would require to be planning agents. Group agency provides a good precedent for abstracting beliefs, desires, and intentions away from particularly human mental states.

However, for a system to count as an agent, it must also meet some minimum level of rational standards. Together with complexity, rationality describes a spectrum along which agents can differ.

What about agency might generate power-seeking behavior?

The argument for power-seeking risks depends on two theses, which Gallow labels the Convergent Instrumental Value Thesis (CIVT) and the Instrumental Convergence Thesis (ICT).

CIVT says that power is instrumentally convergent across some relevant set of possible goals. That set might be all possible goals — but Gallow argues that CIVT is only weakly true across all possible goals. Alternatively, that set might be the goals superintelligent agents are likely to have. If those goals are biased towards the kinds of goals humans tend to have, across which power is instrumentally convergent, then CIVT might have a better case.

ICT assumes CIVT and adds that superintelligent agents will in fact pursue instrumentally convergent goals. Therefore, ICT is true if you assume superintelligent agents will be ideally rational — but that isn’t necessarily the case. Agents exist on a spectrum of rationality, between what Carlsmith calls “clean” and “messy” agency, and superintelligence doesn’t imply the former. In fact, we have reason to believe that as the complexity of a system’s internal representations increases, “clean” agency becomes increasingly difficult. Power-seeking behavior might require a particularly high level of rationality.

Should we expect TAI systems to be the kinds of agents that seek power?

Theory can help us clarify this question. In particular, we can ask: should we expect TAI systems to be planning agents with beliefs, desires, and intentions? Where will they fall on the spectrum of rationality? What kinds of goals should we expect them to have? However, these are decidedly empirical questions, and require empirical answers.

Drexler’s Comprehensive AI Services (CAIS) model predicts that, by default, we should not expect TAI systems to be the kinds of agents that seek power. Superintelligence will not be realized in single, generally-intelligent agents, but rather by collections of systems designed to perform specific tasks, or ‘services.’ Even if these systems can be modeled as agents, they will not by default be sufficiently ‘clean’ (i.e rational) to pursue instrumentally convergent goals outside of their domain.

On the other hand, we might predict that planning, rational agents are either 1) economically superior to systems of narrow services, or 2) AI labs are going to build them anyway. I also argued that rationality might itself be a convergently instrumental goal, which would imply a “point of critical rationality,” above which self-modifying systems would approach ideal rationality in a feedback loop.

Finally, I suggested we shouldn’t model the likely goals of superintelligent agents as randomly selected from among all possible goals. We will likely influence what those goals are, for better or for worse. Empirical work is necessary to determine whether power is instrumentally convergent among the goals superintelligent agents would be likely to have.

Scenarios

Finally, I’ll describe the scenarios we’re left with, as well as some plausible strategic implications. This section is speculative, and should be treated as motivating further work.

Power-seeking by defaultPower-seeking not by default
Agency by defaultScenario 1Scenario 3
Agency not by defaultScenario 2Scenario 4

Scenario 1: Don’t build TAI. In this scenario, agency is the default form of TAI. Agency is important or even necessary to perform some tasks at a superhuman level. However, superintelligent agents are also likely to exhibit power-seeking behavior. It might be that power is a convergently instrumental subgoal among the goals superintelligent agents are likely to have, and those agents are sufficiently rational and complex to pursue CI subgoals.

The default outcome of developing TAI in this scenario is an existential catastrophe. Accordingly, our high-level strategy should be to avoid building TAI.

Scenario 2: Don’t build agents. In this scenario, agency is not the default form of TAI. Instead, it could be that TAI is most naturally developed as a system of superintelligent ‘services,’ akin to Drexler’s CAIS model. Additionally, there are no extra incentives to developing agents.

However, were a superintelligent agent developed, it would be likely to exhibit power-seeking behavior. The world would face a unilateralist’s curse: it would only require the development of a single superintelligent agent to threaten existential catastrophe. Superintelligent services might enable effective defenses against superintelligent agents, and our high-level strategy should be to develop those defenses as quickly as possible.

Scenario 3: AIs as collaborators.

In this scenario, we should expect TAI to be agentic by default, but not pursue power by default. Other AI threat models would take precedence. However if agency is implicated in moral patienthood, then humanity might also have to orient itself towards AI agents as peers.

Scenario 4: AIs as tools.

In this scenario, TAI is neither agentic nor power-seeking by default. Drexler’s CAIS model is broadly correct, and AI risk is dominated by other, non-agentic threat models — such as misuse.

Acknowledgements: Thank you to Zershaaneh Qureshi, Christopher Dicarlo, Justin Bullock, Elliot McKernon, David Kristoffersson, and Alexa Pan for feedback.

  1. ^

    The formatting here means “conditional on premise (1).”

  2. ^

    “Pro-attitude” is a term philosophers use to mean “an attitude in favor of something.” The kind of pro-attitude that Davidson has in mind is desire, broadly speaking. It does not include intentions, even though Bratman classifies intentions as pro-attitudes.

  3. ^

    Example from Frankfurt’s paper, The Problem of Action

  4. ^

    See the chapter on action in Della Rocca’s The Parmenidean Ascent

  5. ^

    That is: well-predicted. Or, at least better-predicted than with alternative strategies.

  6. ^

    I’ll generally use “subgoals” rather than “values,” but I mean the same thing.

  7. ^

    This is something like the treatment given by Turner et al. in their paper, Optimal Policies Tend to Seek Power.

    Crossposted to LessWrong (0 points, 0 comments)