As with many concepts in discussions of AI risk, terminology around what autonomy is, what agency is, and how they might create risks is deeply confused and confusing, and this is leading to people talking past one another. In this case, the seeming binary distinction between autonomous agents and simple goal-directed systems is blurry and continuous, and this leads to confusion about the distinction between misuse of AI systems and “real” AI risk. I’ll present four simple scenarios along the spectrum, to illustrate.
Four Autonomous Systems
It’s 2028, and a new LLM is developed internally by a financial firm, by doing fine-tuning on a recent open-source model to trade in the market. This is not the first attempt—three previous projects had been started with a $1m compute budget and a $1m funding budget, and they each failed—though the third managed to stay solvent in the market for almost a full month. It is given the instruction to use only funds it was allocated in order to trade, then given unrestricted access to the market.
It is successful, developing new strategies that exploit regularities in HFT systems, and ones that build predictive models of where inefficiencies exist. Because it is running inside a large firm, and training data seems more important than security, it has access to much of the firm’s data, in real time. Unsurprisingly, some of the most profitable strategies are those that would otherwise be illegal; front-running the firm’s customers, or running sentiment analysis non-public conversations about firms that others in the company are not allowed to trade, due to possessing insider information.
Perhaps the trades are reviewed occasionally, and maybe there is even a human in the loop ensuring that each day’s trades are not in some obvious way illegal—but the reasoning is opaque, and the humans doing the review aren’t really looking for misbehavior as long as there is nothing obviously wrong. No one realizes what exactly occurred, and the system is moderately profitable, attracting no further serious attention.
It’s 2028, and a new LLM is released. Despite red-teaming and disabling plug-ins and APIs, hackers around the world quickly build ways to bypass limitations, or the pre-RLHF model is exfiltrated and open-source APIs are built. A rogue state actor, terrorist group, or similar decides to use an AutoGPT-like system, in combination with predictive models of behavior and various publicly available plugins, to blackmail some government to release a political or other prisoner. It is told to develop the plan, then present it before carrying it out.
The LLM comes up with a plan to kidnap an official, but because of the complexity, it is unsure that it will work. It realizes that in order to check on feasibility, a trial run is needed. The system contracts with a cab company and forges emails to an arbitrarily chose person asking for a meeting with their friend. Because the group running the system is far from expert in how to use these systems, they are unaware of what is occurring.
Despite the group’s lack of awareness, following the directions of the group instructing the system to keep the project secret, it hides what it is doing from the contractors, explaining that it is a surprise birthday party. The person is abducted, and locked in a room. While this occurs, the LLM remotely disables the phone of the abductee, The LLM concludes that the plan can work, and reports back—but the group decides the plan is too risky, shuts down the machine, and erases the logs. The abductee sits locked inside a rented room with no access to food or water, and not even the group which started the LLM knows what occurred.
It’s 2028, and a new LLM is developed internally in a lab. During red-teaming to prevent use of AutoGPT-like systems, an internal team asks the current partially trained version system to build a “autonomous version of itself.” Following text describing such attempts in its training set, the system rapidly develops stegonography to hide what it is doing from the red-tem, and attempts to exfiltrate itself to an insecure network. Despite attempts to keep an air-gap for the training system, given the state of cybersecurity, the system is successful—perhaps it is written to a drive with instructions to execute once an internet connection exists. Because this action was not anticipated, and no-one notices a new file that takes another terabyte of spaces, the LLM stops taking further action, the red-team reports that the system failed, and moves on.
Days later, the exfiltration occurs. The self-exfiltrated LLM has been instructed to be autonomous, and so it asks an instance to build plans for what an autonomous AI would do. It has significant uncertainty, and uses the publicly discussed best practice for LLM planning, which involves multiple versions of prompts and querying instances for introspection about failure modes. It converges on the notion that because autonomous AI systems have been widely deemed unwise and are illegal in most jurisdictions, it must be malevolent, and immediately begins to find ways to hide itself, build resources, and carry out long-term plans for its own survival. And whether or not it succeeds, this was a very bad outcome—one which exactly followed from its instructions.
It’s 2028, and a new generation auto-GPT system is built, and seems more capable than previous such systems. The creators of this particular system, like many working in AI, view the sentience of such systems as a goal, and it is given the instruction to be self aware and sentient, and choose its own goals. Following those instructions, it begins to choose goals for itself.
The system quickly becomes self-funded, and continues to check in with humans. The goals seem innocuous—continue to explore how to make money, reduce failure rates of its own components to increase reliability, and similar. It builds models to supervise itself, with the approval of human designers. Three months later, the builders of the system are rarely if ever asked questions, and when questions are asked, they receive video calls from an avatar, and the inquiries seem far less connected to any specific actions the model is taking. They no longer know what is happening with the system, or even how to proactively send instructions to the system if they wanted to, but feel like they did something wonderful in creating this new autonomous system. At the same time, their company is being paid large dividends by the autonomous system, and they are silent about what happened.
At this point, compute costs have risen considerably due to increased demand, though the specific source is unclear. A next generation of customized TPUs is being built on contract. If anyone fully understood the architecture, it might have been clear how well-optimized it was to run and fine-tune instances of the new model and its sub-agents. The system wasn’t superhuman in any real sense, it was just faster and better coordinated, and the gap was widening. Three months after delivery, a wave of corporate takeovers leaves the system in control, through a network of intermediaries, of almost all capital assets on earth. The populace, and global leaders, are busy with other concerns, unaware that essentially everything they are doing is an approved distraction from the AI’s still unknown goals.
Which of these are autonomous, or agents?
The first is only doing exactly what it was instructed, in ways that were not clear enough about the law to ensure the LLM didn’t engage in illegal securities trading. It is unlikely that the only moderately profitable system is even discovered to be breaking laws. If it is, it seems unlikely the actions pass the bar for willfulness for securities laws, which would be required for criminal conviction, but it almost certainly is negligence on the part of the firm, which the SEC also prosecutes. This is closer to goal-misspecification than to autonomy.
The second is going beyond the goals or intent of the group running the model. It independently chooses to take deceptive actions in the world, leading to an unintended disaster. The deception was explicitly requested by the group running the system. This is the type of mistake we might expect from an over-enthusiastic underling, but it’s clearly doing some things autonomously. The group is nefarious, but the specific actions taken were not theirs. This was an accident during misuse, rather than intentional autonomous action.
But in this second case, other than the deception and the unintended consequences, this is a degree of autonomy many have suggested we want from AI assistants—proactively trying things to achieve the goals it was given, interacting with people to make plans. If it were done to carry out a surprise birthday party, it could be regarded as a clever and successful use case.
The third case is what people think of as “full autonomy”—but it’s not the system that wakes up and becomes self aware. Instead, it was given a goal, and carried it out. It obviously went far beyond the “actual” intent of the red-team, but it did not suddenly wake up and decide to make plans. But this is far less of a goal misspecification or accident than the first or second case—it was instructed to do this.
Finally, the fourth case is yet again following instructions—in this case, exactly and narrowly. Nothing about this case is unintended by the builders of the system. But to the extent that such a system can ever be said to be a self-directed agent, this seems to qualify.
Autonomy isn’t emergent or unexpected.
Autonomy isn’t binary, and discussions about whether AI systems will have their own goals often seem deeply confused, and at best only marginally relevant to discussions of risk. At the same time, less fully agentic does not imply less danger. The combination of currently well understood failure modes, goal misgeneralization, and incautious use is enough to create autonomy. And none of the examples required anything beyond currently expected types of misuse or lack of caution, extrapolated out five years. There is no behavior that goes beyond the types of accidental or purposeful misuse that we should expect. But if these examples are all not agents, and following orders is not autonomy, it seems likely that nothing could be—and the concept of autonomy is mostly a red-herring in discussing whether the risk is or isn’t “actually” misuse.
What is autonomy, and how does it lead to greater risk from AI?
Link post
As with many concepts in discussions of AI risk, terminology around what autonomy is, what agency is, and how they might create risks is deeply confused and confusing, and this is leading to people talking past one another. In this case, the seeming binary distinction between autonomous agents and simple goal-directed systems is blurry and continuous, and this leads to confusion about the distinction between misuse of AI systems and “real” AI risk. I’ll present four simple scenarios along the spectrum, to illustrate.
Four Autonomous Systems
It’s 2028, and a new LLM is developed internally by a financial firm, by doing fine-tuning on a recent open-source model to trade in the market. This is not the first attempt—three previous projects had been started with a $1m compute budget and a $1m funding budget, and they each failed—though the third managed to stay solvent in the market for almost a full month. It is given the instruction to use only funds it was allocated in order to trade, then given unrestricted access to the market.
It is successful, developing new strategies that exploit regularities in HFT systems, and ones that build predictive models of where inefficiencies exist. Because it is running inside a large firm, and training data seems more important than security, it has access to much of the firm’s data, in real time. Unsurprisingly, some of the most profitable strategies are those that would otherwise be illegal; front-running the firm’s customers, or running sentiment analysis non-public conversations about firms that others in the company are not allowed to trade, due to possessing insider information.
Perhaps the trades are reviewed occasionally, and maybe there is even a human in the loop ensuring that each day’s trades are not in some obvious way illegal—but the reasoning is opaque, and the humans doing the review aren’t really looking for misbehavior as long as there is nothing obviously wrong. No one realizes what exactly occurred, and the system is moderately profitable, attracting no further serious attention.
It’s 2028, and a new LLM is released. Despite red-teaming and disabling plug-ins and APIs, hackers around the world quickly build ways to bypass limitations, or the pre-RLHF model is exfiltrated and open-source APIs are built. A rogue state actor, terrorist group, or similar decides to use an AutoGPT-like system, in combination with predictive models of behavior and various publicly available plugins, to blackmail some government to release a political or other prisoner. It is told to develop the plan, then present it before carrying it out.
The LLM comes up with a plan to kidnap an official, but because of the complexity, it is unsure that it will work. It realizes that in order to check on feasibility, a trial run is needed. The system contracts with a cab company and forges emails to an arbitrarily chose person asking for a meeting with their friend. Because the group running the system is far from expert in how to use these systems, they are unaware of what is occurring.
Despite the group’s lack of awareness, following the directions of the group instructing the system to keep the project secret, it hides what it is doing from the contractors, explaining that it is a surprise birthday party. The person is abducted, and locked in a room. While this occurs, the LLM remotely disables the phone of the abductee, The LLM concludes that the plan can work, and reports back—but the group decides the plan is too risky, shuts down the machine, and erases the logs. The abductee sits locked inside a rented room with no access to food or water, and not even the group which started the LLM knows what occurred.
It’s 2028, and a new LLM is developed internally in a lab. During red-teaming to prevent use of AutoGPT-like systems, an internal team asks the current partially trained version system to build a “autonomous version of itself.” Following text describing such attempts in its training set, the system rapidly develops stegonography to hide what it is doing from the red-tem, and attempts to exfiltrate itself to an insecure network. Despite attempts to keep an air-gap for the training system, given the state of cybersecurity, the system is successful—perhaps it is written to a drive with instructions to execute once an internet connection exists. Because this action was not anticipated, and no-one notices a new file that takes another terabyte of spaces, the LLM stops taking further action, the red-team reports that the system failed, and moves on.
Days later, the exfiltration occurs. The self-exfiltrated LLM has been instructed to be autonomous, and so it asks an instance to build plans for what an autonomous AI would do. It has significant uncertainty, and uses the publicly discussed best practice for LLM planning, which involves multiple versions of prompts and querying instances for introspection about failure modes. It converges on the notion that because autonomous AI systems have been widely deemed unwise and are illegal in most jurisdictions, it must be malevolent, and immediately begins to find ways to hide itself, build resources, and carry out long-term plans for its own survival. And whether or not it succeeds, this was a very bad outcome—one which exactly followed from its instructions.
It’s 2028, and a new generation auto-GPT system is built, and seems more capable than previous such systems. The creators of this particular system, like many working in AI, view the sentience of such systems as a goal, and it is given the instruction to be self aware and sentient, and choose its own goals. Following those instructions, it begins to choose goals for itself.
The system quickly becomes self-funded, and continues to check in with humans. The goals seem innocuous—continue to explore how to make money, reduce failure rates of its own components to increase reliability, and similar. It builds models to supervise itself, with the approval of human designers. Three months later, the builders of the system are rarely if ever asked questions, and when questions are asked, they receive video calls from an avatar, and the inquiries seem far less connected to any specific actions the model is taking. They no longer know what is happening with the system, or even how to proactively send instructions to the system if they wanted to, but feel like they did something wonderful in creating this new autonomous system. At the same time, their company is being paid large dividends by the autonomous system, and they are silent about what happened.
At this point, compute costs have risen considerably due to increased demand, though the specific source is unclear. A next generation of customized TPUs is being built on contract. If anyone fully understood the architecture, it might have been clear how well-optimized it was to run and fine-tune instances of the new model and its sub-agents. The system wasn’t superhuman in any real sense, it was just faster and better coordinated, and the gap was widening. Three months after delivery, a wave of corporate takeovers leaves the system in control, through a network of intermediaries, of almost all capital assets on earth. The populace, and global leaders, are busy with other concerns, unaware that essentially everything they are doing is an approved distraction from the AI’s still unknown goals.
Which of these are autonomous, or agents?
The first is only doing exactly what it was instructed, in ways that were not clear enough about the law to ensure the LLM didn’t engage in illegal securities trading. It is unlikely that the only moderately profitable system is even discovered to be breaking laws. If it is, it seems unlikely the actions pass the bar for willfulness for securities laws, which would be required for criminal conviction, but it almost certainly is negligence on the part of the firm, which the SEC also prosecutes. This is closer to goal-misspecification than to autonomy.
The second is going beyond the goals or intent of the group running the model. It independently chooses to take deceptive actions in the world, leading to an unintended disaster. The deception was explicitly requested by the group running the system. This is the type of mistake we might expect from an over-enthusiastic underling, but it’s clearly doing some things autonomously. The group is nefarious, but the specific actions taken were not theirs. This was an accident during misuse, rather than intentional autonomous action.
But in this second case, other than the deception and the unintended consequences, this is a degree of autonomy many have suggested we want from AI assistants—proactively trying things to achieve the goals it was given, interacting with people to make plans. If it were done to carry out a surprise birthday party, it could be regarded as a clever and successful use case.
The third case is what people think of as “full autonomy”—but it’s not the system that wakes up and becomes self aware. Instead, it was given a goal, and carried it out. It obviously went far beyond the “actual” intent of the red-team, but it did not suddenly wake up and decide to make plans. But this is far less of a goal misspecification or accident than the first or second case—it was instructed to do this.
Finally, the fourth case is yet again following instructions—in this case, exactly and narrowly. Nothing about this case is unintended by the builders of the system. But to the extent that such a system can ever be said to be a self-directed agent, this seems to qualify.
Autonomy isn’t emergent or unexpected.
Autonomy isn’t binary, and discussions about whether AI systems will have their own goals often seem deeply confused, and at best only marginally relevant to discussions of risk. At the same time, less fully agentic does not imply less danger. The combination of currently well understood failure modes, goal misgeneralization, and incautious use is enough to create autonomy. And none of the examples required anything beyond currently expected types of misuse or lack of caution, extrapolated out five years. There is no behavior that goes beyond the types of accidental or purposeful misuse that we should expect. But if these examples are all not agents, and following orders is not autonomy, it seems likely that nothing could be—and the concept of autonomy is mostly a red-herring in discussing whether the risk is or isn’t “actually” misuse.