Thanks for reading the post Catherine! I like this list a lot, and I agree that trying to answer ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ is the key here.
I think that trying to evaluate research agendas might still be important given this. We may struggle to verify the most general version of the claim above, but maybe we can make progress if we restrict ourselves to analysing the kinds of evidence that are generated by specific research agendas. Hence, if we try to answer the claim as in the context of specific research agendas (like “to what extent does interpretability give us evidence of alignment in AGI systems?”), the question might become more tractable, although this is offset by having to answer more questions!
Thanks for reading the post Catherine! I like this list a lot, and I agree that trying to answer ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ is the key here.
I think that trying to evaluate research agendas might still be important given this. We may struggle to verify the most general version of the claim above, but maybe we can make progress if we restrict ourselves to analysing the kinds of evidence that are generated by specific research agendas. Hence, if we try to answer the claim as in the context of specific research agendas (like “to what extent does interpretability give us evidence of alignment in AGI systems?”), the question might become more tractable, although this is offset by having to answer more questions!