Sorry if this isn’t as polished as I’d hoped. Still a lot to read and think about, but posting as I won’t have time now to elaborate further before the weekend. Thanks for doing the AMA!
It seems like a crux that you have identified is how “sudden emergence” happens. How would a recursive self-improvement feedback loop start? Increasing optimisation capacity is a convergent instrumental goal. But how exactly is that goal reached? To give the most pertinent example—what would the nuts and bolts of it be for it happening in an ML system? It’s possible to imagine a sufficiently large pile of linear algebra enabling recursive chain reactions of both improvement in algorithmic efficiency, and size (e.g. capturing all global compute → nanotech → converting Earth to Computronium). Even more so since GPT-3. But what would the trigger be for setting it off?
Does the above summary of my take of this chime with yours? Do you (or anyone else reading) know of any attempts at articulating such a “nuts-and-bolts” explanation of “sudden emergence” of AGI in an ML system?
Or maybe there would be no trigger? Maybe a great many arbitrary goals would lead to sufficiently large ML systems brute-force stumbling upon recursive self-improvement as an instrumental goal (or mesa-optimisation)?
Responding to some quotes from the 80,000 Hours podcast:
“It’s not really that’s surprising, I don’t have this wild destructive preference about how they’re arranged. Let’s say the atoms in this room. The general principle here is that if you want to try and predict what some future technology will look like, maybe there is some predictive power you get from thinking about X percent of the ways of doing this involve property P. But it’s important to think about where there’s a process by which this technology or artifact will emerge. Is that the sort of process that will be differentially attracted to things which are let’s say benign? If so, then maybe that outweighs the fact that most possible designs are not benign.”
What mechanism makes AI be attracted to benign things? Surely only through human direction? But to my mind the whole Bostrom/Yudkowsky argument is that it FOOMs out of control of humans (and e.g. converts everything into Computronium as a convergent instrumental goal.)
“There’s some intuition of just the gap between something that’s going around and let’s say murdering people and using their atoms for engineering projects and something that’s doing whatever it is you want it to be doing seems relatively large.”
This reads like a bit of a strawman. My intuition for the problem of instrumental convergence is that in many take-off scenarios the AI will perform (a lot) more compute, and the way it will do this is by converting all available matter to Computronium (with human-existential collateral damage). From what I’ve read, you don’t directly touch on such scenarios. Would be interested to hear your thoughts on them.
“my impression is that you typically won’t get behaviours which are radically different or that seem like the system’s going for something completely different.”
Whilst you might not typically get radically different behaviours, in the cases where ML systems do fail, they tend to failcatastrophically (in ways that a human never would)! This also fits in with the notion of hidden proxy goals from “mesa optimisers” being a major concern (as well as accurate and sufficient specification of human goals).
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way:
..language modelling has one crucial difference from Chess or Go or image classification. Natural language essentially encodes information about the world—the entire world, not just the world of the Goban, in a much more expressive way than any other modality ever could.[1] By harnessing the world model embedded in the language model, it may be possible to build a proto-AGI.
...
This is more a thought experiment than something that’s actually going to happen tomorrow; GPT-3 today just isn’t good enough at world modelling. Also, this method depends heavily on at least one major assumption—that bigger future models will have much better world modelling capabilities—and a bunch of other smaller implicit assumptions. However, this might be the closest thing we ever get to a chance to sound the fire alarm for AGI: there’s now a concrete path to proto-AGI that has a non-negligible chance of working.
Sorry if this isn’t as polished as I’d hoped. Still a lot to read and think about, but posting as I won’t have time now to elaborate further before the weekend. Thanks for doing the AMA!
It seems like a crux that you have identified is how “sudden emergence” happens. How would a recursive self-improvement feedback loop start? Increasing optimisation capacity is a convergent instrumental goal. But how exactly is that goal reached? To give the most pertinent example—what would the nuts and bolts of it be for it happening in an ML system? It’s possible to imagine a sufficiently large pile of linear algebra enabling recursive chain reactions of both improvement in algorithmic efficiency, and size (e.g. capturing all global compute → nanotech → converting Earth to Computronium). Even more so since GPT-3. But what would the trigger be for setting it off?
Does the above summary of my take of this chime with yours? Do you (or anyone else reading) know of any attempts at articulating such a “nuts-and-bolts” explanation of “sudden emergence” of AGI in an ML system?
Or maybe there would be no trigger? Maybe a great many arbitrary goals would lead to sufficiently large ML systems brute-force stumbling upon recursive self-improvement as an instrumental goal (or mesa-optimisation)?
Responding to some quotes from the 80,000 Hours podcast:
What mechanism makes AI be attracted to benign things? Surely only through human direction? But to my mind the whole Bostrom/Yudkowsky argument is that it FOOMs out of control of humans (and e.g. converts everything into Computronium as a convergent instrumental goal.)
This reads like a bit of a strawman. My intuition for the problem of instrumental convergence is that in many take-off scenarios the AI will perform (a lot) more compute, and the way it will do this is by converting all available matter to Computronium (with human-existential collateral damage). From what I’ve read, you don’t directly touch on such scenarios. Would be interested to hear your thoughts on them.
Whilst you might not typically get radically different behaviours, in the cases where ML systems do fail, they tend to fail catastrophically (in ways that a human never would)! This also fits in with the notion of hidden proxy goals from “mesa optimisers” being a major concern (as well as accurate and sufficient specification of human goals).
Here is an argument for how GPT-X might lead to proto-AGI in a more concrete, human-aided, way: