Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
..language modelling has one crucial difference from Chess or Go or image classification. Natural language essentially encodes information about the world—the entire world, not just the world of the Goban, in a much more expressive way than any other modality ever could.[1] By harnessing the world model embedded in the language model, it may be possible to build a proto-AGI.
...
This is more a thought experiment than something that’s actually going to happen tomorrow; GPT-3 today just isn’t good enough at world modelling. Also, this method depends heavily on at least one major assumption—that bigger future models will have much better world modelling capabilities—and a bunch of other smaller implicit assumptions. However, this might be the closest thing we ever get to a chance to sound the fire alarm for AGI: there’s now a concrete path to proto-AGI that has a non-negligible chance of working.
It explains how a GPT-X could become an AGI (via world modelling). I think then things like the basic drives would take over. However, maybe it’s not the end result model that we should be looking at as dangerous, but rather the training process? A ML-based (proto-)AGI could do all sorts of dangerous (consequentialist, basic-AI-drives-y) things whilst trying to optimise for performance in training.
I think then things like the basic drives would take over.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
Crossposted from here
Thanks. :) Is that just a general note along the lines of what I was saying, or does it explain how a GPT-X AGI would become consequentialist?
It explains how a GPT-X could become an AGI (via world modelling). I think then things like the basic drives would take over. However, maybe it’s not the end result model that we should be looking at as dangerous, but rather the training process? A ML-based (proto-)AGI could do all sorts of dangerous (consequentialist, basic-AI-drives-y) things whilst trying to optimise for performance in training.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.