Part of your question here seems to be, “If we can design a system that understands goals written in natural language, won’t it be very unlikely to deviate from what we really wanted when we wrote the goal?” Regarding that point, I’m not an expert, but I’ll point to some discussion by experts.
There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart’s law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I’m guessing that specifying goals that way hasn’t seemed even faintly possible until recently, so nobody’s really tried it?)
Here’s a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled “reward misspecification”—though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.
Part of your question here seems to be, “If we can design a system that understands goals written in natural language, won’t it be very unlikely to deviate from what we really wanted when we wrote the goal?” Regarding that point, I’m not an expert, but I’ll point to some discussion by experts.
There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart’s law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I’m guessing that specifying goals that way hasn’t seemed even faintly possible until recently, so nobody’s really tried it?)
Here’s a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled “reward misspecification”—though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.