In terms of goal directedness, I think a lot of the danger hinges on whether and what kinds of internal models of the world will emerge in different systems, andnot knowing what those will look like. Many capabilities people didn’t necessarily expect or foresee suddenly emerged after more training—for example the jumps in abilities from GPT to GPT-2 and GPT-3. A similar jump to the emergence of internal models of the world may happen at another threshold.
I think I would feel better if we had some way of concretely and robustly specifying “goal directedness that doesn’t go out of control” with the training models that are currently being used. Or at least something to show a robust model of how these systems currently work like, “current models are doing xyz abilities by manipulating data in this way which will never include that testable ability in the next 1-3 years but will likely include these abilities in that timeframe.”
In terms of an AI vs all human intelligence combined, even assuming all humans combined are more intelligent than an AGI, and that this AGI is for whatever reasons not able to drastically self-improve, it could still make copies of itself. And each one could think thousands of times faster than any person. Current trends show it takes far more compute/resources to train a model than to run it, so an AGI that copies itself thousands or millions of times with each of them modifying themselves to be better and better at specific tasks would still be really dangerous if their goals are misaligned. As far as I can tell, it would be pretty easy for a group of AGIs all smarter than the smartest humans to hack into and take over any systems they needed through the internet and by deceiving people to acquire resources when that’s insufficient by itself. And given that they will all be created with the same goals, their coordination will likely be much better than humans.
In terms of goal directedness, I think a lot of the danger hinges on whether and what kinds of internal models of the world will emerge in different systems, and not knowing what those will look like. Many capabilities people didn’t necessarily expect or foresee suddenly emerged after more training—for example the jumps in abilities from GPT to GPT-2 and GPT-3. A similar jump to the emergence of internal models of the world may happen at another threshold.
I think I would feel better if we had some way of concretely and robustly specifying “goal directedness that doesn’t go out of control” with the training models that are currently being used. Or at least something to show a robust model of how these systems currently work like, “current models are doing xyz abilities by manipulating data in this way which will never include that testable ability in the next 1-3 years but will likely include these abilities in that timeframe.”
In terms of an AI vs all human intelligence combined, even assuming all humans combined are more intelligent than an AGI, and that this AGI is for whatever reasons not able to drastically self-improve, it could still make copies of itself. And each one could think thousands of times faster than any person. Current trends show it takes far more compute/resources to train a model than to run it, so an AGI that copies itself thousands or millions of times with each of them modifying themselves to be better and better at specific tasks would still be really dangerous if their goals are misaligned. As far as I can tell, it would be pretty easy for a group of AGIs all smarter than the smartest humans to hack into and take over any systems they needed through the internet and by deceiving people to acquire resources when that’s insufficient by itself. And given that they will all be created with the same goals, their coordination will likely be much better than humans.