I’m speaking very much for myself and not for MIRI here. But, here goes (this is pretty similar to the view described here):
If we build AI systems out of business-as-usual ML, we’re going to end up with systems probably trained with some kind of meta learning (as described in Risks from Learned Optimization) and they’re going to be completely uninterpretable and we’re not going to be able to fix the inner alignment. And by default our ML systems won’t be able to handle the strain of doing radical self-improvement, and they’ll accidentally allow their goals to shift as they self-improve (in the same way that if you tried to make a physicist by giving a ten year old access to a whole bunch of crazy mind altering/enhancing drugs and the ability to do brain surgery on themselves, you might have unstable results). We can’t fix this with things like ML transparency or adversarial training or ML robustness. The only hope of building aligned really-powerful-AI-systems is having a much clearer picture of what we’re doing when we try to build these systems.
I’m hearing “the current approach will fail by default, so we need a different approach. In particular, the new approach should be clearer about the reasoning of the AI system than current approaches.”
Noticeably, that’s different from a positive case that sounds like “Here is such an approach and why it could work.”
I’m curious how much of your thinking is currently split between the two rough possibilities below.
First:
I don’t know of another approach that could work, so while I maybe personally feel more of an ability to understand some people’s ideas than others, many people’s very different concrete suggestions for approaches to understanding these systems better are all arguably similar in terms of how likely we should think they are to pan out, and how much resources we should want to put behind them.
Alternatively, second:
While it’s incredibly difficult to communicate mathematical intuitions of this depth, my sense is I can see a very attractive case for why one or two particular efforts (e.g. MIRI’s embedded agency work) could work out.
I’m speaking very much for myself and not for MIRI here. But, here goes (this is pretty similar to the view described here):
If we build AI systems out of business-as-usual ML, we’re going to end up with systems probably trained with some kind of meta learning (as described in Risks from Learned Optimization) and they’re going to be completely uninterpretable and we’re not going to be able to fix the inner alignment. And by default our ML systems won’t be able to handle the strain of doing radical self-improvement, and they’ll accidentally allow their goals to shift as they self-improve (in the same way that if you tried to make a physicist by giving a ten year old access to a whole bunch of crazy mind altering/enhancing drugs and the ability to do brain surgery on themselves, you might have unstable results). We can’t fix this with things like ML transparency or adversarial training or ML robustness. The only hope of building aligned really-powerful-AI-systems is having a much clearer picture of what we’re doing when we try to build these systems.
Thanks :)
I’m hearing “the current approach will fail by default, so we need a different approach. In particular, the new approach should be clearer about the reasoning of the AI system than current approaches.”
Noticeably, that’s different from a positive case that sounds like “Here is such an approach and why it could work.”
I’m curious how much of your thinking is currently split between the two rough possibilities below.
First:
Alternatively, second: