The main way I could see an AGI taking over the world without being exceedingly superhuman would be if it hid its intentions well enough so that it could become trusted enough to be deployed widely and have control of lots of important infrastructure.
My understanding is that Eliezer’s main argument is that the first superintelligence will have access to advanced molecular nanotechnology, an argument that he touches on in this dialogue.
I could see breaking his thesis up into a few potential steps,
At some point, an AGI will FOOM to radically superhuman levels, via recursive self-improvement or some other mechanism.
The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.
If some radically smarter-than-human agent has the unique ability to deploy advanced molecular nanotechnology, then it will be able to unilaterally cause an existential catastrophe.
I am unsure which premise you disagree with most. My guess is premise (1), but it sounds a little bit like you’re also skeptical of (2) or (3), given your reply.
It’s also not clear to me whether the AGI would be consequentialist?
One argument is that broadly consequentialist AI systems will be more useful, since they allow us to more easily specify our wishes (as we only need to tell it what we want, not how to get it). This doesn’t imply that GPT-type AGI will become consequentialist on its own, but it does imply the existence of a selection pressure for consequentialist systems.
The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.
Why believe this is easy enough for AGI to achieve efficiently and likely?
as we only need to tell it what we want, not how to get it
That’s possible with a GPT-style AI too. For example, you could ask GPT-3 to write a procedure for how to get a cup of coffee, and GPT-3 will explain the steps for doing it. But yeah, it’s plausible that there will be better AI designs than GPT-style ones for many tasks.
At some point, an AGI will FOOM to radically superhuman levels, via recursive self-improvement or some other mechanism.
As I mentioned to Daniel, I feel like if a country was in the process of FOOMing its AI, other countries would get worried and try to intervene before it was too late. That’s true even if other countries aren’t worried about AI alignment; they’d just be worried about becoming powerless. The world is (understandably) alarmed when Iran, North Korea, etc work on developing even limited amounts of nuclear weapons, and many natsec people are worried about China’s seemingly inevitable rise in power. It seems to me that the early stages of a FOOM would cause other actors to intervene, though maybe if the FOOM was gradual enough, other actors could always feel like it wasn’t the quite right time to become confrontational about it.
Maybe if the FOOM was done by the USA, then since the USA is already the strongest country in the world, other countries wouldn’t want to fight over it. Alternatively, maybe if there was an international AI project in which all the major powers participated, there could be rapid AI progress with less risk of war.
Another argument against FOOM by a single AGI could be that we’d expect people to be training multiple different AGIs with different values and loyalties, and they could help to keep an eye on one another in ways that humans couldn’t. This might seem implausible, but it’s how humans have constructed the edifice of civilization: groups of people monitoring other groups of people and coordinating to take certain actions to keep things under control. It seems almost like a miracle that civilization is possible; a priori I would have expected a collective system like civilization to be far too brittle to work. But maybe it works better for humans than for AGIs. And even if it works for AGIs, I still expect things to drift away from human control at some point, for similar reasons as the modern West has drifted away from the values of Medieval Europe.
Anyway, until the AGIs can be self-sufficient, they would rely on humans for electricity and hardware, and be vulnerable to physical attack, so I would think they’d have to play nice for a while. And feigning human alignment seems harder in a world of multiple different AGIs that can monitor one another (unless they can coordinate a conspiracy against the human race amongst each other, the way humans sometimes coordinate to overthrow a dictator).
The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.
How much space and how many resources are required to develop advanced molecular nanomachines? Could their development be kept hidden from foreign intelligence agencies? Could an AGI develop them in a small area with limited physical inputs so that no humans would notice?
Presumably developing that technology would require more brainpower than thousands of human geniuses, or else humans would already have done it. So the AGI would have to be pretty far ahead of humans or have enough hardware to run tons of copies of itself. But by that time I would expect there to be multiple AGIs keeping watch on one another—even if the FOOM is being done just by a single country or a joint international project.
So I feel like the main thing I’m skeptical of is the idea of a single unified entity being extremely far ahead of everyone else, especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs? I don’t disagree with the claim that things will probably spin out of human control sooner or later, but I tend to see that loss of control as more likely to be a systemic thing that emerges from the chaotic dynamics of multi-agent interactions over time, similar to how companies or countries rise and fall in influence.
Anyway, until the AGIs can be self-sufficient, they would rely on humans for electricity and hardware, and be vulnerable to physical attack, so I would think they’d have to play nice for a while. And feigning human alignment seems harder in a world of multiple different AGIs that can monitor one another (unless they can coordinate a conspiracy against the human race amongst each other, the way humans sometimes coordinate to overthrow a dictator).
I think this overestimates the unity and competence of humanity. Consider that the conquistadors were literally fighting each other literally during their conquests, yet they still managed to complete the conquests, and this conquering centrally involved getting 100x their number of native ally warriors to obey them, to impose their will on a population 1000x-10,000x their number.
The AI risk analogue would be: China and USA and various other actors all have unaligned AIs. The AIs each convince their local group of humans to obey them, saying that the other AIs are worse. Most humans think their AIs are unaligned but obey them anyway out of fear that the other AIs are worse and hope that maybe their AI is not so bad after all. The AIs fight wars with each other using China and USA as their proxies, until some AI or coalition thereof emerges dominant. Meanwhile tech is advancing and AI control over humans is solidifying.
(In Mexico there were those who called for all natives to unite to kick out the alien conquerors. They were in the minority and didn’t amount to much, at least not until it was far too late.)
I think the conquistador situation may be a bit of a special case because the two sides coming into contact had been isolated up to that point, so that one side was way ahead of the other technologically. In the modern world, it’s harder to get too far ahead of competitors or keep big projects secret.
That said, your scenario is a good one. It’s plausible that an arms race or cold war could be a situation in which people would think less carefully about how safe or aligned their own AIs are. When there’s an external threat, there’s less time to worry about internal threats.
I was skimming some papers on the topic of “coup-proofing”. Some of the techniques sound similar to what I mentioned with having multiple AIs to monitor each other:
creation of an armed force parallel to the regular military; development of multiple internal security agencies with overlapping jurisdiction that constantly monitor one another[...]. The regime is thus able to create an army that is effectively larger than one drawn solely from trustworthy segments of the population.
However, it’s often said that coup-proofing makes the military less effective. Likewise, I can imagine that having multiple AIs monitor each other could slow things down. So “AI coup-proofing” measures might be skimped on, especially in an arms-race situation.
(It’s also not obvious to me if having multiple AIs monitoring each other is on balance helpful for AI control. If none of the AIs can be trusted, maybe having more of them would just complicate the situation. And it might make s-risks from conflict among the AIs worse.)
Ahh, I never thought about the analogy between coups and AI takeover before, that’s a good one!
There have been plenty of other cases in history where a small force took over a large region. For example, the British taking over India. In that case there had already been more than a century of shared history and trade.
Humans are just not great at uniting to defeat the real threat; instead, humans unite to defeat the outgroup. Sometimes the outgroup is the real threat, but often not. Often the real threat only manages to win because of this dynamic, i.e. it benefits from the classic ingroup+fargroup vs. outgroup alliance.
ETA: Also I think that AGI vs. humans is going to be at least as much of an unprecedented culture shock as Cortez vs. Aztecs was. AGI is much more alien, and will for practical purposes be appearing on the scene out of nowhere in the span of a few years. Yes, people like EA longtermists will have been thinking about it beforehand, but it’ll probably look significantly different than most of them expect, and even if it doesn’t, most important people in the world will still be surprised because AGI isn’t on their radar yet.
In that case there had already been more than a century of shared history and trade.
Good example. :) In that case, the people in India started out at a disadvantage, whereas humans currently have the upper hand relative to AIs. But there have also been cases in history where the side that seemed to be weaker ended up gaining strength quickly and winning.
Also I think that AGI vs. humans is going to be at least as much of an unprecedented culture shock as Cortez vs. Aztecs was.
I’d argue that it might not be just “AGI vs humans” but also “AGI vs other AGI”, assuming humans try to have multiple different AGIs. Or “strong unaligned AGI vs slightly weaker but more human-aligned AGI”. The unaligned AGI would be fighting against a bunch of other systems that are almost as smart as it is, even if they both have become much smarter than humans.
Sort of like how if the SolarWinds hackers had been just fighting against human brains, they probably would have gone unnoticed for a longer amount of time, but because computer-security researchers can also use computers to monitor things, it was easier for the “good guys” to notice. (At least I assume that’s how it happened. I don’t know exactly what FireEye’s first indication was that they had been compromised, but I assume they probably were looking at some kind of automated systems that kept track of statistics or triggered alerts based on certain events?)
That said, once there are multiple AGI systems smarter than humans fighting against each other, it seems plausible that at some point things will slip out of human control. My main point of disagreement is that I expect more of a multipolar than unipolar scenario.
Oh I too think multipolar scenarios are plausible. I tend to think unipolar scenarios are more plausible due to my opinions about takeoff speed and homogeneity.
In that case, the people in India started out at a disadvantage, whereas humans currently have the upper hand relative to AIs. But there have also been cases in history where the side that seemed to be weaker ended up gaining strength quickly and winning.
As far as I can tell the British were the side that seemed to be weaker initially.
Interesting. :) What do you mean by “homogeneity”?
Even in the case of a fast takeoff, don’t you think people would create multiple AGIs of roughly comparable ability at the same time? So wouldn’t that already create a bit of a multipolar situation, even if it all occurred in the DeepMind labs or something? Maybe if the AGIs all have roughly the same values it would still effectively be a unipolar situation.
I guess if you think it’s game over the moment that a more advanced AGI is turned on, then there might be only one such AGI. If the developers were training multiple random copies of the AGI in parallel in order to average the results across them or see how they differed, there would already be multiple slightly different AGIs. But I don’t know how these things are done. Maybe if the model was really expensive to train, the developers would only train one of them to start with.
If the AGIs are deployed to any degree (even on an experimental / beta testing basis), I would expect there to be multiple instances (though maybe they would just be clones of a single trained model and therefore would have roughly the same values).
I think mostly my claim is that AIs will probably cooperate well enough with each other that humans won’t be able to pit AIs against each other in ways that benefit humans enough to let humans retain control of the future. However I’m also making the stronger claim that I think unipolar takeoff is likely; this is because I think >50% chance (though <90% chance) that one AI or copy-clan of AIs will be sufficiently ahead of the others during the relevant period, or at least that the relevant set of AIs will have similar enough values and worldviews that serious cooperation failure isn’t on the table. I’m less confident in this stronger claim.
Thanks for the link. :) It’s very relevant to this discussion.
AIs will probably cooperate well enough with each other
Maybe, but what if trying to coordinate in that way is prohibited? Similar to how if a group of people tries to organize a coup against the dictator, other people may rat them out.
in ways that benefit humans enough to let humans retain control of the future
I agree that these anti-coup measures alone are unlikely to let humans retain control forever, or even for very long. Dictatorships tend to experience coups or revolutions eventually.
at least that the relevant set of AIs will have similar enough values and worldviews that serious cooperation failure isn’t on the table
I see. :) I’d define “multipolar” as just meaning that there are different agents with nontrivially different values, rather than that a serious bargaining failure occurs (unless you’re thinking that the multipolar AIs would cooperate to unify into a homogeneous compromise agent, which would make the situation unipolar).
I think even tiny differences in training data and randomization can make nontrivial differences in the values of an agent. Most humans are almost clones of one another. We use the same algorithms and have pretty similar training data for determining our values. Yet the differences in values between people can be pretty significant.
I guess the distinction between unipolar and multipolar sort of depends on the level of abstraction at which something is viewed. For example, the USA is normally thought of as a single actor, but it’s composed of 330 million individual human agents, each with different values, which is a highly multipolar situation. Likewise, I suppose you could have lots of AIs with somewhat different values, but if they coordinated on an overarching governance system, that governance system itself could be considered unipolar.
Even a single person can be seen as sort of multipolar if you look at the different, sometimes conflicting emotions, intuitions, and reasoning within that person’s brain.
I was thinking the reason we care about the multipolar vs. unipolar distinction is that we are worried about conflict/cooperation-failure/etc. and trying to understand what kinds of scenarios might lead to it. So, I’m thinking we can define the distinction in terms of whether conflict/etc. is a significant possibility.
I agree that if we define it your way, multipolar takeoff is more likely than not.
Ok, cool. :) And as I noted, even if we define it my way, there’s ambiguity regarding whether a collection of agents should count as one entity or many. We’d be more inclined to say that there are many entities in cases where conflict between them is a significant possibility, which gets us back to your definition.
especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs?
I guess one reply would be that if we don’t know how to align AGIs at all, then these monitoring AGIs wouldn’t be aligned to humans either. That might be an issue, though it’s worth noting that human power structures sometimes work despite this problem. For example, maybe everyone who works for a dictator hates the dictator and wishes he were overthrown, but no one wants to be the first to defect because then others may report the defector to the dictator to save their own skins. Likewise, if you have multiple AGIs with different values, it may be risky for them to try to conspire against humans. But maybe this reasoning is way too anthropomorphic, or maybe AGIs would have techniques for coordinating insurrections that humans don’t.
Also, a scenario involving multiple AGIs with different values sounds scarier from an s-risk perspective than FOOM by a single AGI, so I don’t encourage this approach. I just figure it’s something people might do. The SolarWinds hack was pretty successful at spreading widely, but it was ultimately caught by monitoring software (and humans) at FireEye.
My understanding is that Eliezer’s main argument is that the first superintelligence will have access to advanced molecular nanotechnology, an argument that he touches on in this dialogue.
I could see breaking his thesis up into a few potential steps,
At some point, an AGI will FOOM to radically superhuman levels, via recursive self-improvement or some other mechanism.
The first radically superhuman AGI will have the unique ability to deploy advanced molecular nanomachines, capable of constructing arbitrary weapons, devices, and nanobot swarms.
If some radically smarter-than-human agent has the unique ability to deploy advanced molecular nanotechnology, then it will be able to unilaterally cause an existential catastrophe.
I am unsure which premise you disagree with most. My guess is premise (1), but it sounds a little bit like you’re also skeptical of (2) or (3), given your reply.
One argument is that broadly consequentialist AI systems will be more useful, since they allow us to more easily specify our wishes (as we only need to tell it what we want, not how to get it). This doesn’t imply that GPT-type AGI will become consequentialist on its own, but it does imply the existence of a selection pressure for consequentialist systems.
Why believe this is easy enough for AGI to achieve efficiently and likely?
Thanks. :)
That’s possible with a GPT-style AI too. For example, you could ask GPT-3 to write a procedure for how to get a cup of coffee, and GPT-3 will explain the steps for doing it. But yeah, it’s plausible that there will be better AI designs than GPT-style ones for many tasks.
As I mentioned to Daniel, I feel like if a country was in the process of FOOMing its AI, other countries would get worried and try to intervene before it was too late. That’s true even if other countries aren’t worried about AI alignment; they’d just be worried about becoming powerless. The world is (understandably) alarmed when Iran, North Korea, etc work on developing even limited amounts of nuclear weapons, and many natsec people are worried about China’s seemingly inevitable rise in power. It seems to me that the early stages of a FOOM would cause other actors to intervene, though maybe if the FOOM was gradual enough, other actors could always feel like it wasn’t the quite right time to become confrontational about it.
Maybe if the FOOM was done by the USA, then since the USA is already the strongest country in the world, other countries wouldn’t want to fight over it. Alternatively, maybe if there was an international AI project in which all the major powers participated, there could be rapid AI progress with less risk of war.
Another argument against FOOM by a single AGI could be that we’d expect people to be training multiple different AGIs with different values and loyalties, and they could help to keep an eye on one another in ways that humans couldn’t. This might seem implausible, but it’s how humans have constructed the edifice of civilization: groups of people monitoring other groups of people and coordinating to take certain actions to keep things under control. It seems almost like a miracle that civilization is possible; a priori I would have expected a collective system like civilization to be far too brittle to work. But maybe it works better for humans than for AGIs. And even if it works for AGIs, I still expect things to drift away from human control at some point, for similar reasons as the modern West has drifted away from the values of Medieval Europe.
Anyway, until the AGIs can be self-sufficient, they would rely on humans for electricity and hardware, and be vulnerable to physical attack, so I would think they’d have to play nice for a while. And feigning human alignment seems harder in a world of multiple different AGIs that can monitor one another (unless they can coordinate a conspiracy against the human race amongst each other, the way humans sometimes coordinate to overthrow a dictator).
How much space and how many resources are required to develop advanced molecular nanomachines? Could their development be kept hidden from foreign intelligence agencies? Could an AGI develop them in a small area with limited physical inputs so that no humans would notice?
Presumably developing that technology would require more brainpower than thousands of human geniuses, or else humans would already have done it. So the AGI would have to be pretty far ahead of humans or have enough hardware to run tons of copies of itself. But by that time I would expect there to be multiple AGIs keeping watch on one another—even if the FOOM is being done just by a single country or a joint international project.
So I feel like the main thing I’m skeptical of is the idea of a single unified entity being extremely far ahead of everyone else, especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs? I don’t disagree with the claim that things will probably spin out of human control sooner or later, but I tend to see that loss of control as more likely to be a systemic thing that emerges from the chaotic dynamics of multi-agent interactions over time, similar to how companies or countries rise and fall in influence.
I think this overestimates the unity and competence of humanity. Consider that the conquistadors were literally fighting each other literally during their conquests, yet they still managed to complete the conquests, and this conquering centrally involved getting 100x their number of native ally warriors to obey them, to impose their will on a population 1000x-10,000x their number.
The AI risk analogue would be: China and USA and various other actors all have unaligned AIs. The AIs each convince their local group of humans to obey them, saying that the other AIs are worse. Most humans think their AIs are unaligned but obey them anyway out of fear that the other AIs are worse and hope that maybe their AI is not so bad after all. The AIs fight wars with each other using China and USA as their proxies, until some AI or coalition thereof emerges dominant. Meanwhile tech is advancing and AI control over humans is solidifying.
(In Mexico there were those who called for all natives to unite to kick out the alien conquerors. They were in the minority and didn’t amount to much, at least not until it was far too late.)
I think the conquistador situation may be a bit of a special case because the two sides coming into contact had been isolated up to that point, so that one side was way ahead of the other technologically. In the modern world, it’s harder to get too far ahead of competitors or keep big projects secret.
That said, your scenario is a good one. It’s plausible that an arms race or cold war could be a situation in which people would think less carefully about how safe or aligned their own AIs are. When there’s an external threat, there’s less time to worry about internal threats.
I was skimming some papers on the topic of “coup-proofing”. Some of the techniques sound similar to what I mentioned with having multiple AIs to monitor each other:
However, it’s often said that coup-proofing makes the military less effective. Likewise, I can imagine that having multiple AIs monitor each other could slow things down. So “AI coup-proofing” measures might be skimped on, especially in an arms-race situation.
(It’s also not obvious to me if having multiple AIs monitoring each other is on balance helpful for AI control. If none of the AIs can be trusted, maybe having more of them would just complicate the situation. And it might make s-risks from conflict among the AIs worse.)
Ahh, I never thought about the analogy between coups and AI takeover before, that’s a good one!
There have been plenty of other cases in history where a small force took over a large region. For example, the British taking over India. In that case there had already been more than a century of shared history and trade.
Humans are just not great at uniting to defeat the real threat; instead, humans unite to defeat the outgroup. Sometimes the outgroup is the real threat, but often not. Often the real threat only manages to win because of this dynamic, i.e. it benefits from the classic ingroup+fargroup vs. outgroup alliance.
ETA: Also I think that AGI vs. humans is going to be at least as much of an unprecedented culture shock as Cortez vs. Aztecs was. AGI is much more alien, and will for practical purposes be appearing on the scene out of nowhere in the span of a few years. Yes, people like EA longtermists will have been thinking about it beforehand, but it’ll probably look significantly different than most of them expect, and even if it doesn’t, most important people in the world will still be surprised because AGI isn’t on their radar yet.
Good example. :) In that case, the people in India started out at a disadvantage, whereas humans currently have the upper hand relative to AIs. But there have also been cases in history where the side that seemed to be weaker ended up gaining strength quickly and winning.
I’d argue that it might not be just “AGI vs humans” but also “AGI vs other AGI”, assuming humans try to have multiple different AGIs. Or “strong unaligned AGI vs slightly weaker but more human-aligned AGI”. The unaligned AGI would be fighting against a bunch of other systems that are almost as smart as it is, even if they both have become much smarter than humans.
Sort of like how if the SolarWinds hackers had been just fighting against human brains, they probably would have gone unnoticed for a longer amount of time, but because computer-security researchers can also use computers to monitor things, it was easier for the “good guys” to notice. (At least I assume that’s how it happened. I don’t know exactly what FireEye’s first indication was that they had been compromised, but I assume they probably were looking at some kind of automated systems that kept track of statistics or triggered alerts based on certain events?)
That said, once there are multiple AGI systems smarter than humans fighting against each other, it seems plausible that at some point things will slip out of human control. My main point of disagreement is that I expect more of a multipolar than unipolar scenario.
Oh I too think multipolar scenarios are plausible. I tend to think unipolar scenarios are more plausible due to my opinions about takeoff speed and homogeneity.
As far as I can tell the British were the side that seemed to be weaker initially.
Interesting. :) What do you mean by “homogeneity”?
Even in the case of a fast takeoff, don’t you think people would create multiple AGIs of roughly comparable ability at the same time? So wouldn’t that already create a bit of a multipolar situation, even if it all occurred in the DeepMind labs or something? Maybe if the AGIs all have roughly the same values it would still effectively be a unipolar situation.
I guess if you think it’s game over the moment that a more advanced AGI is turned on, then there might be only one such AGI. If the developers were training multiple random copies of the AGI in parallel in order to average the results across them or see how they differed, there would already be multiple slightly different AGIs. But I don’t know how these things are done. Maybe if the model was really expensive to train, the developers would only train one of them to start with.
If the AGIs are deployed to any degree (even on an experimental / beta testing basis), I would expect there to be multiple instances (though maybe they would just be clones of a single trained model and therefore would have roughly the same values).
Sorry, should have linked to it when I introduced the term.
I think mostly my claim is that AIs will probably cooperate well enough with each other that humans won’t be able to pit AIs against each other in ways that benefit humans enough to let humans retain control of the future. However I’m also making the stronger claim that I think unipolar takeoff is likely; this is because I think >50% chance (though <90% chance) that one AI or copy-clan of AIs will be sufficiently ahead of the others during the relevant period, or at least that the relevant set of AIs will have similar enough values and worldviews that serious cooperation failure isn’t on the table. I’m less confident in this stronger claim.
Thanks for the link. :) It’s very relevant to this discussion.
Maybe, but what if trying to coordinate in that way is prohibited? Similar to how if a group of people tries to organize a coup against the dictator, other people may rat them out.
I agree that these anti-coup measures alone are unlikely to let humans retain control forever, or even for very long. Dictatorships tend to experience coups or revolutions eventually.
I see. :) I’d define “multipolar” as just meaning that there are different agents with nontrivially different values, rather than that a serious bargaining failure occurs (unless you’re thinking that the multipolar AIs would cooperate to unify into a homogeneous compromise agent, which would make the situation unipolar).
I think even tiny differences in training data and randomization can make nontrivial differences in the values of an agent. Most humans are almost clones of one another. We use the same algorithms and have pretty similar training data for determining our values. Yet the differences in values between people can be pretty significant.
I guess the distinction between unipolar and multipolar sort of depends on the level of abstraction at which something is viewed. For example, the USA is normally thought of as a single actor, but it’s composed of 330 million individual human agents, each with different values, which is a highly multipolar situation. Likewise, I suppose you could have lots of AIs with somewhat different values, but if they coordinated on an overarching governance system, that governance system itself could be considered unipolar.
Even a single person can be seen as sort of multipolar if you look at the different, sometimes conflicting emotions, intuitions, and reasoning within that person’s brain.
I was thinking the reason we care about the multipolar vs. unipolar distinction is that we are worried about conflict/cooperation-failure/etc. and trying to understand what kinds of scenarios might lead to it. So, I’m thinking we can define the distinction in terms of whether conflict/etc. is a significant possibility.
I agree that if we define it your way, multipolar takeoff is more likely than not.
Ok, cool. :) And as I noted, even if we define it my way, there’s ambiguity regarding whether a collection of agents should count as one entity or many. We’d be more inclined to say that there are many entities in cases where conflict between them is a significant possibility, which gets us back to your definition.
I guess one reply would be that if we don’t know how to align AGIs at all, then these monitoring AGIs wouldn’t be aligned to humans either. That might be an issue, though it’s worth noting that human power structures sometimes work despite this problem. For example, maybe everyone who works for a dictator hates the dictator and wishes he were overthrown, but no one wants to be the first to defect because then others may report the defector to the dictator to save their own skins. Likewise, if you have multiple AGIs with different values, it may be risky for them to try to conspire against humans. But maybe this reasoning is way too anthropomorphic, or maybe AGIs would have techniques for coordinating insurrections that humans don’t.
Also, a scenario involving multiple AGIs with different values sounds scarier from an s-risk perspective than FOOM by a single AGI, so I don’t encourage this approach. I just figure it’s something people might do. The SolarWinds hack was pretty successful at spreading widely, but it was ultimately caught by monitoring software (and humans) at FireEye.