This seems very useful to me. I’ve read books by Russel, Christiansen, and Bostrom, plus a load of other misc EA content (EA Forum, EAG, 80k, etc) about AI Alignment but wouldn’t have been able to distinguish these separate strands. So for me at least, this seems like very helpful de-confusion.
A couple of questions, if you’ve got time:
1
In your ~30 conversations with and feedback from others, did you get much of a sense that others disagreed with your general categorisations here? That is, I’m sure that there are various ways that one could conceptually carve up the space, but did you get much feedback suggesting that yours might be wrong in some substantial way?
I’m trying to get a sense if this post represents a reasonable but controversial interpretation of the landscape or if it would be widely accepted.
2
You helpfully list some existing resources for each approach. Do you have a sense of roughly how resources (e.g. number of researchers / research hours; philanthropic $s) are currently divided between these different approaches?
(3)
(I’d also be interested in how you or others would see the ideal distribution of resources, but I infer from your post that there might be a lot of disagreement about that.)
Thanks for the feedback! Really glad to hear it was helpful de-confusion for people who’ve already engaged somewhat with AI Alignment, but aren’t actively researching in the field, that’s part of what I was aiming for.
1
I didn’t get much feedback on my categorisation, I was mostly trying to absorb other people’s inside views on their specific strand of alignment. And most of the feedback on the doc was more object-level discussion of each section. I didn’t get feedback suggesting this was wrong in some substantial way, but I’d also expect it to be considered ‘reasonable but controversial’ rather than widely accepted.
If it helps, I’m most uncertain about the following parts of this conceptualisation:
Separating power-seeking AI and inner misalignment, rather than merging them—inner misalignment seems like the most likely way this happens
Having assistance games as an agenda, rather than as a way to address the power-seeking AI or you get what you measure threat models
Not having recursive reward modelling as a fully fledged agenda (this may just be because I haven’t read enough about it to really have my head around it properly)
Putting reinforcement learning from human feedback under you get what you measure—this seems like a pretty big fraction of current alignment effort, and might be better put under a category like ‘narrowly aligning superhuman models’
2
It’s hard to be precise, but there’s definitely not an even distribution. And it depends a lot on which resources you care about.
A lot of the safety work at industry labs revolves around trying to align large language models, mostly with tools like reinforcement learning from human feedback. I mostly categorise this under you get what you measure, though I’m open to pushback there. This is very resource intensive, especially if you include the costs of training those large language models in the first place, and consumes a lot of capital, engineer time, and researcher time. Though much of the money comes from companies like Google, rather than philanthropic sources.
The other large collections of researchers are at MIRI, who mostly do deconfusion work, and CHAI, who do a lot of things, including a bunch of good field-building, but probably the modal type of work is on training AIs with assistance games? This is more speculative though.
Most of the remaining areas are fairly small, though these are definitely not clear-cut distinctions.
It’s unclear which of these resources are most important to track—training large models is very capital intensive, and doing anything with them is fairly labour intensive and needs good engineers. But as eg OpenPhil’s recent RFPs show, there’s a lot of philanthropic dollars available for researchers who have a credible case for being able to do good alignment research, suggesting we’re more bottlenecked by researcher time? And there we’re much more bottlenecked by senior researcher time than junior researcher time.
3
Very hard to say, sorry! Personally, I’m most excited about inner alignment and interpretability and really want to see those having more resources. Generally, I’d also want to see a more even distribution of resources for exploration, diversification and value of information reasons. I expect different people would give wildly varying opinions.
This seems very useful to me. I’ve read books by Russel, Christiansen, and Bostrom, plus a load of other misc EA content (EA Forum, EAG, 80k, etc) about AI Alignment but wouldn’t have been able to distinguish these separate strands. So for me at least, this seems like very helpful de-confusion.
A couple of questions, if you’ve got time:
1 In your ~30 conversations with and feedback from others, did you get much of a sense that others disagreed with your general categorisations here? That is, I’m sure that there are various ways that one could conceptually carve up the space, but did you get much feedback suggesting that yours might be wrong in some substantial way? I’m trying to get a sense if this post represents a reasonable but controversial interpretation of the landscape or if it would be widely accepted.
2 You helpfully list some existing resources for each approach. Do you have a sense of roughly how resources (e.g. number of researchers / research hours; philanthropic $s) are currently divided between these different approaches?
(3) (I’d also be interested in how you or others would see the ideal distribution of resources, but I infer from your post that there might be a lot of disagreement about that.)
Thanks for the feedback! Really glad to hear it was helpful de-confusion for people who’ve already engaged somewhat with AI Alignment, but aren’t actively researching in the field, that’s part of what I was aiming for.
I didn’t get much feedback on my categorisation, I was mostly trying to absorb other people’s inside views on their specific strand of alignment. And most of the feedback on the doc was more object-level discussion of each section. I didn’t get feedback suggesting this was wrong in some substantial way, but I’d also expect it to be considered ‘reasonable but controversial’ rather than widely accepted.
If it helps, I’m most uncertain about the following parts of this conceptualisation:
Separating power-seeking AI and inner misalignment, rather than merging them—inner misalignment seems like the most likely way this happens
Having assistance games as an agenda, rather than as a way to address the power-seeking AI or you get what you measure threat models
Not having recursive reward modelling as a fully fledged agenda (this may just be because I haven’t read enough about it to really have my head around it properly)
Putting reinforcement learning from human feedback under you get what you measure—this seems like a pretty big fraction of current alignment effort, and might be better put under a category like ‘narrowly aligning superhuman models’
It’s hard to be precise, but there’s definitely not an even distribution. And it depends a lot on which resources you care about.
A lot of the safety work at industry labs revolves around trying to align large language models, mostly with tools like reinforcement learning from human feedback. I mostly categorise this under you get what you measure, though I’m open to pushback there. This is very resource intensive, especially if you include the costs of training those large language models in the first place, and consumes a lot of capital, engineer time, and researcher time. Though much of the money comes from companies like Google, rather than philanthropic sources.
The other large collections of researchers are at MIRI, who mostly do deconfusion work, and CHAI, who do a lot of things, including a bunch of good field-building, but probably the modal type of work is on training AIs with assistance games? This is more speculative though.
Most of the remaining areas are fairly small, though these are definitely not clear-cut distinctions.
It’s unclear which of these resources are most important to track—training large models is very capital intensive, and doing anything with them is fairly labour intensive and needs good engineers. But as eg OpenPhil’s recent RFPs show, there’s a lot of philanthropic dollars available for researchers who have a credible case for being able to do good alignment research, suggesting we’re more bottlenecked by researcher time? And there we’re much more bottlenecked by senior researcher time than junior researcher time.
Very hard to say, sorry! Personally, I’m most excited about inner alignment and interpretability and really want to see those having more resources. Generally, I’d also want to see a more even distribution of resources for exploration, diversification and value of information reasons. I expect different people would give wildly varying opinions.