I think you’re on to something and some related thoughts are a significant part of my research agenda. Here are some references you might find useful (heavily biased towards my own thinking on the subject), numbered by paragraph in your post:
I think I disagree quite strongly that “We don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’ … [or even] simpler physical systems.”, e.g. I suspect many alignment-relevant concepts (like ‘Helpful, Harmless, Honest’) are abstract and groundable in language, see e.g. Language is more abstract than you think, or, why aren’t languages more iconic?. Also, the previous point (brain-LM comparisons), as well as LM performance, suggest the linguistic grounding is probably already happening to a significant degree.
I think you’re on to something and some related thoughts are a significant part of my research agenda. Here are some references you might find useful (heavily biased towards my own thinking on the subject), numbered by paragraph in your post:
There’s a lot of cumulated evidence of significant overlap between LM and human linguistic representations, scaling laws of this phenomenon seem favorable and LM embeddings have also been used as a model of shared linguistic space for transmitting thoughts during communication. I interpret this as suggesting outer alignment will likely be solved by default for LMs.
I think I disagree quite strongly that “We don’t know how to get an AI system’s goals to robustly ‘point at’ objects like ‘the American people’ … [or even] simpler physical systems.”, e.g. I suspect many alignment-relevant concepts (like ‘Helpful, Harmless, Honest’) are abstract and groundable in language, see e.g. Language is more abstract than you think, or, why aren’t languages more iconic?. Also, the previous point (brain-LM comparisons), as well as LM performance, suggest the linguistic grounding is probably already happening to a significant degree.
Robustness here seems hard, see e.g. these references on shortcuts in in-context learning (ICL) / prompting: https://arxiv.org/abs/2303.03846 https://arxiv.org/abs/2305.17256 https://arxiv.org/abs/2305.13299 https://arxiv.org/abs/2305.14950 https://arxiv.org/abs/2305.19148. An easier / more robust target might be something like ‘be helpful’. Though I agree in general ICL as Bayesian inference (see e.g. http://ai.stanford.edu/blog/understanding-incontext/ and follow the citation trail, there are a lot of recent related works) suggests that the longer the prompt, the more likely it would be to ‘locate the task’.
I’ll also note that the role of the Constitution in Constitutional AI (https://www.anthropic.com/index/claudes-constitution) seems quite related to your 3rd paragraph.