I’ve tried to engage with the intro materials, but I still have several questions:
a. Why doesn’t my proposed prompt solve outer alignment?
b. Why would AI ever pursue a proxy goal at the expense of its assigned goal? The human evolution analogy doesn’t quite make sense to me because evolution isn’t an algorithm with an assigned goal. Besides, even when friendship doesn’t increase the odds of reproduction, it doesn’t decrease the odds either; so this doesn’t seem like an example where the proxy goal is being pursued at the expense of the assigned goal.
c. I’ve read that it’s very difficult to get AI to point at any specific thing in the environment. But wouldn’t that problem be resolved if AI deeply understood the distribution of ways that humans use language to refer to things?
Thanks for thoughtfully engaging with this topic! I’ve spent a lot of time exploring arguments that alignment is hard, and am also unconvinced. I’m particularly skeptical about deceptive alignment, which is closely related to your point b. I’m clearly not the right person to explain why people think the problem is hard, but I think it’s good to share alternative perspectives.
If you’re interested in more skeptical arguments, there’s a forum tag and a lesswrong tag. I particularly like Quintin Pope’s posts on the topic.
I’ve tried to engage with the intro materials, but I still have several questions:
a. Why doesn’t my proposed prompt solve outer alignment?
b. Why would AI ever pursue a proxy goal at the expense of its assigned goal? The human evolution analogy doesn’t quite make sense to me because evolution isn’t an algorithm with an assigned goal. Besides, even when friendship doesn’t increase the odds of reproduction, it doesn’t decrease the odds either; so this doesn’t seem like an example where the proxy goal is being pursued at the expense of the assigned goal.
c. I’ve read that it’s very difficult to get AI to point at any specific thing in the environment. But wouldn’t that problem be resolved if AI deeply understood the distribution of ways that humans use language to refer to things?
Thanks for thoughtfully engaging with this topic! I’ve spent a lot of time exploring arguments that alignment is hard, and am also unconvinced. I’m particularly skeptical about deceptive alignment, which is closely related to your point b. I’m clearly not the right person to explain why people think the problem is hard, but I think it’s good to share alternative perspectives.
If you’re interested in more skeptical arguments, there’s a forum tag and a lesswrong tag. I particularly like Quintin Pope’s posts on the topic.