Thanks for posting this! I agree that AI alignment is currently pre-paradigmatic, but I disagree (at least partially) with your conclusions.
You mention two kinds of steering in your post: one that’s focused on evaluating the assumptions, theories of change, and likelihood of meaningful contribution to aligning AI of specific research agendas[1], and another that investigates the extent to which we can get experimental evidence about alignment strategies from sub-AGI systems. I think the latter question is a crux for assessing the goodness of different research agendas, and that we should spend much more time working on it. I’m unconvinced that we should spend much more time working on the former.
The reason I think the question of whether sub-AGI alignment evidence generalises to AGI is a crux is because it really informs which research agendas we should pursue. If it turns out that evidence does generalise, we can treat alignment like a science in the normal way: making some theories, testing their predictions, and seeing which theories best fit the evidence. So, if evidence generalisation is true, we don’t need to spend any more time evaluating theoretical assumptions and theories of change and so on—we just need to keep gathering evidence and discarding poorly-fitting theories. However, if a) evidence does not generalise well and b) we want to be very very confident that we can align AGI, then we should spend way less time on research agendas which have weak theoretical justifications (even if they have strong evidential justifications), and more time on research agendas which seem to have strong theoretical justifications.
So if we’re in the ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ world, I pretty much agree with you, but with caveats:
I agree with Oscar that steering (in the ‘evaluating assumptions and theories of change of different research agendas’ sense) seems really difficult unless you have a lot of context and experience, and even then it sounds like it would take a long time to do a good job. I’m pessimistic that anyone other than established alignment researchers could do a good job, and unfortunately the opportunity cost of alignment researchers working on steering is really high.
Oscar argues that the comparative advantage of senior researchers is to steer, and therefore they should spend more time steering. This conclusion doesn’t follow if you think steering and rowing have sufficiently different value. I think the value of senior researchers doing direct research is sufficiently high that even though they are comparatively better suited to steer, they should still spend their time on direct research.
As we get closer to AGI (using your definition of ‘AI systems which are at least as capable as humans across a range of domains’), we should be more and more surprised if evidence of alignment doesn’t generalise. I would guess that AGI and not-quite-as-capable-as-humans-but-extremely-close AI aren’t qualitatively different, so it would be surprising if there was this discontinuous jump in how useful evidence is for assessing competing theories.
This means that as we get closer to AGI, we should be more confident that evidence of alignment on current AI systems is helpful, and so spend more time rowing (doing research which produces evidence) vs steering (thinking about the assumptions and theories of change of different research agendas).
It might be really hard to figure out whether sub-AGI evidence of alignment tells us about AGI alignment. In that case, given our uncertainty it makes sense to spend some time steering as you describe it (i.e. evaluating the assumptions and theories of change of different research agendas). But this is time-consuming and has a high opportunity cost, and our answer to the evidence question is crucial to figuring out the amount of time we should spend on this. Given this, I think the steering we do should be focused on figuring out the overarching question of whether sub-AGI evidence of alignment tells us about AGI alignment, and not on the narrower task of evaluating different research agendas. Plausible research agendas should just be pursued.
I think in your post you move between referring to epistemic strategies and research agendas. I understood it better when I took you to mean research agendas, so I’ll use that in my comment.
Regarding the ‘plausible research agendas’ that should be pursued, I generally agree, while noting that even deciding on plausibility isn’t necessarily uncontroversial. Currently, I suppose it is grantmakers that decide this plausibility, which seems alright.
Also, given the large amounts of money available for conducting plausible alignment research, it seems less valuable to steer or think about the relative value of different research agendas, as it is less decision relevant when almost everything will be funded anyway. Though in the future if community-building is very successful and we 10x alignment researchers, prioritisation within alignment would become a lot more important I imagine.
Thanks for reading the post Oscar! Going to reply to both of your comments here! I haven’t thought a lot about when one should start “steering” in their career, but I think starting with an approach focussed on rowing makes a lot of sense.
Addressing the idea that steering is less important if we can just fund all possible research agendas, I don’t think this necessarily holds. It seems that we are talent-constrained at least to an extent, and so every researcher focussed on a hopeless / implausible research agenda is one that isn’t working on a plausible research agenda. Thus, even with lots of funding, steering is still important.
Thanks for reading the post Catherine! I like this list a lot, and I agree that trying to answer ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ is the key here.
I think that trying to evaluate research agendas might still be important given this. We may struggle to verify the most general version of the claim above, but maybe we can make progress if we restrict ourselves to analysing the kinds of evidence that are generated by specific research agendas. Hence, if we try to answer the claim as in the context of specific research agendas (like “to what extent does interpretability give us evidence of alignment in AGI systems?”), the question might become more tractable, although this is offset by having to answer more questions!
Thanks for posting this! I agree that AI alignment is currently pre-paradigmatic, but I disagree (at least partially) with your conclusions.
You mention two kinds of steering in your post: one that’s focused on evaluating the assumptions, theories of change, and likelihood of meaningful contribution to aligning AI of specific research agendas[1], and another that investigates the extent to which we can get experimental evidence about alignment strategies from sub-AGI systems. I think the latter question is a crux for assessing the goodness of different research agendas, and that we should spend much more time working on it. I’m unconvinced that we should spend much more time working on the former.
The reason I think the question of whether sub-AGI alignment evidence generalises to AGI is a crux is because it really informs which research agendas we should pursue. If it turns out that evidence does generalise, we can treat alignment like a science in the normal way: making some theories, testing their predictions, and seeing which theories best fit the evidence. So, if evidence generalisation is true, we don’t need to spend any more time evaluating theoretical assumptions and theories of change and so on—we just need to keep gathering evidence and discarding poorly-fitting theories. However, if a) evidence does not generalise well and b) we want to be very very confident that we can align AGI, then we should spend way less time on research agendas which have weak theoretical justifications (even if they have strong evidential justifications), and more time on research agendas which seem to have strong theoretical justifications.
So if we’re in the ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ world, I pretty much agree with you, but with caveats:
I agree with Oscar that steering (in the ‘evaluating assumptions and theories of change of different research agendas’ sense) seems really difficult unless you have a lot of context and experience, and even then it sounds like it would take a long time to do a good job. I’m pessimistic that anyone other than established alignment researchers could do a good job, and unfortunately the opportunity cost of alignment researchers working on steering is really high.
Oscar argues that the comparative advantage of senior researchers is to steer, and therefore they should spend more time steering. This conclusion doesn’t follow if you think steering and rowing have sufficiently different value. I think the value of senior researchers doing direct research is sufficiently high that even though they are comparatively better suited to steer, they should still spend their time on direct research.
As we get closer to AGI (using your definition of ‘AI systems which are at least as capable as humans across a range of domains’), we should be more and more surprised if evidence of alignment doesn’t generalise. I would guess that AGI and not-quite-as-capable-as-humans-but-extremely-close AI aren’t qualitatively different, so it would be surprising if there was this discontinuous jump in how useful evidence is for assessing competing theories.
This means that as we get closer to AGI, we should be more confident that evidence of alignment on current AI systems is helpful, and so spend more time rowing (doing research which produces evidence) vs steering (thinking about the assumptions and theories of change of different research agendas).
It might be really hard to figure out whether sub-AGI evidence of alignment tells us about AGI alignment. In that case, given our uncertainty it makes sense to spend some time steering as you describe it (i.e. evaluating the assumptions and theories of change of different research agendas). But this is time-consuming and has a high opportunity cost, and our answer to the evidence question is crucial to figuring out the amount of time we should spend on this. Given this, I think the steering we do should be focused on figuring out the overarching question of whether sub-AGI evidence of alignment tells us about AGI alignment, and not on the narrower task of evaluating different research agendas. Plausible research agendas should just be pursued.
I think in your post you move between referring to epistemic strategies and research agendas. I understood it better when I took you to mean research agendas, so I’ll use that in my comment.
Regarding the ‘plausible research agendas’ that should be pursued, I generally agree, while noting that even deciding on plausibility isn’t necessarily uncontroversial. Currently, I suppose it is grantmakers that decide this plausibility, which seems alright.
Also, given the large amounts of money available for conducting plausible alignment research, it seems less valuable to steer or think about the relative value of different research agendas, as it is less decision relevant when almost everything will be funded anyway. Though in the future if community-building is very successful and we 10x alignment researchers, prioritisation within alignment would become a lot more important I imagine.
Thanks for reading the post Oscar! Going to reply to both of your comments here! I haven’t thought a lot about when one should start “steering” in their career, but I think starting with an approach focussed on rowing makes a lot of sense.
Addressing the idea that steering is less important if we can just fund all possible research agendas, I don’t think this necessarily holds. It seems that we are talent-constrained at least to an extent, and so every researcher focussed on a hopeless / implausible research agenda is one that isn’t working on a plausible research agenda. Thus, even with lots of funding, steering is still important.
Yes, good point, I now think I was wrong about how important the amount of funding is for steering.
Thanks for reading the post Catherine! I like this list a lot, and I agree that trying to answer ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ is the key here.
I think that trying to evaluate research agendas might still be important given this. We may struggle to verify the most general version of the claim above, but maybe we can make progress if we restrict ourselves to analysing the kinds of evidence that are generated by specific research agendas. Hence, if we try to answer the claim as in the context of specific research agendas (like “to what extent does interpretability give us evidence of alignment in AGI systems?”), the question might become more tractable, although this is offset by having to answer more questions!