especially since you could just spin up some more instances of the AGI that have different values/roles such as monitoring the main AGIs?
I guess one reply would be that if we don’t know how to align AGIs at all, then these monitoring AGIs wouldn’t be aligned to humans either. That might be an issue, though it’s worth noting that human power structures sometimes work despite this problem. For example, maybe everyone who works for a dictator hates the dictator and wishes he were overthrown, but no one wants to be the first to defect because then others may report the defector to the dictator to save their own skins. Likewise, if you have multiple AGIs with different values, it may be risky for them to try to conspire against humans. But maybe this reasoning is way too anthropomorphic, or maybe AGIs would have techniques for coordinating insurrections that humans don’t.
Also, a scenario involving multiple AGIs with different values sounds scarier from an s-risk perspective than FOOM by a single AGI, so I don’t encourage this approach. I just figure it’s something people might do. The SolarWinds hack was pretty successful at spreading widely, but it was ultimately caught by monitoring software (and humans) at FireEye.
I guess one reply would be that if we don’t know how to align AGIs at all, then these monitoring AGIs wouldn’t be aligned to humans either. That might be an issue, though it’s worth noting that human power structures sometimes work despite this problem. For example, maybe everyone who works for a dictator hates the dictator and wishes he were overthrown, but no one wants to be the first to defect because then others may report the defector to the dictator to save their own skins. Likewise, if you have multiple AGIs with different values, it may be risky for them to try to conspire against humans. But maybe this reasoning is way too anthropomorphic, or maybe AGIs would have techniques for coordinating insurrections that humans don’t.
Also, a scenario involving multiple AGIs with different values sounds scarier from an s-risk perspective than FOOM by a single AGI, so I don’t encourage this approach. I just figure it’s something people might do. The SolarWinds hack was pretty successful at spreading widely, but it was ultimately caught by monitoring software (and humans) at FireEye.