I think then things like the basic drives would take over.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.
Where would those basic drives come from (apart from during training)? An already trained GPT-3 model tries to output text that looks human-like, so we might imagine that a GPT-X AGI would also try to behave in ways that look human-like, and most humans aren’t very consequentialist. Humans do try to preserve themselves against harm or death, but not in an “I need to take over the world to ensure I’m not killed” kind of way.
If your concern is about optimization during training, that makes sense, though I’m confused as to whether it’s dangerous if the AI only updates its weights via a human-specified gradient-descent process, and the AI’s “personality” doesn’t care about how accurate its output is.
Yes, concern is optimisation during training. My intuition is along the lines of “sufficiently large pile of linear algebra with reward function-> basic AI drives maximise reward->reverse engineers [human behaviour / protein folding / etc] and manipulates the world so as to maximise it’s reward ->[foom / doom]”.
I wouldn’t say “personality” comes into it. In the above scenario the giant pile of linear algebra is completely unconscious and lacks self-awareness; it’s more akin to a force of nature, a blind optimisation process.
Thanks. :) Regarding the AGI’s “personality”, what I meant was what the AGI itself wants to do, if we imagine it to be like a person, rather than what the training that produced it was optimizing for. If we think of gradient descent to train the AGI as like evolution and the AGI at some step of training as like a particular human in humanity’s evolution, then while evolution itself is optimizing something, the individual human is just an adaptation executor and doesn’t directly care about his inclusive fitness. He just responds to his environment as he was programmed to do. Likewise, the GPT-X agent may not really care about trying to reduce training errors by modifying its network weights; it just responds to its inputs in human-ish ways.