Thanks for the feedback! Really glad to hear it was helpful de-confusion for people who’ve already engaged somewhat with AI Alignment, but aren’t actively researching in the field, that’s part of what I was aiming for.
1
I didn’t get much feedback on my categorisation, I was mostly trying to absorb other people’s inside views on their specific strand of alignment. And most of the feedback on the doc was more object-level discussion of each section. I didn’t get feedback suggesting this was wrong in some substantial way, but I’d also expect it to be considered ‘reasonable but controversial’ rather than widely accepted.
If it helps, I’m most uncertain about the following parts of this conceptualisation:
Separating power-seeking AI and inner misalignment, rather than merging them—inner misalignment seems like the most likely way this happens
Having assistance games as an agenda, rather than as a way to address the power-seeking AI or you get what you measure threat models
Not having recursive reward modelling as a fully fledged agenda (this may just be because I haven’t read enough about it to really have my head around it properly)
Putting reinforcement learning from human feedback under you get what you measure—this seems like a pretty big fraction of current alignment effort, and might be better put under a category like ‘narrowly aligning superhuman models’
2
It’s hard to be precise, but there’s definitely not an even distribution. And it depends a lot on which resources you care about.
A lot of the safety work at industry labs revolves around trying to align large language models, mostly with tools like reinforcement learning from human feedback. I mostly categorise this under you get what you measure, though I’m open to pushback there. This is very resource intensive, especially if you include the costs of training those large language models in the first place, and consumes a lot of capital, engineer time, and researcher time. Though much of the money comes from companies like Google, rather than philanthropic sources.
The other large collections of researchers are at MIRI, who mostly do deconfusion work, and CHAI, who do a lot of things, including a bunch of good field-building, but probably the modal type of work is on training AIs with assistance games? This is more speculative though.
Most of the remaining areas are fairly small, though these are definitely not clear-cut distinctions.
It’s unclear which of these resources are most important to track—training large models is very capital intensive, and doing anything with them is fairly labour intensive and needs good engineers. But as eg OpenPhil’s recent RFPs show, there’s a lot of philanthropic dollars available for researchers who have a credible case for being able to do good alignment research, suggesting we’re more bottlenecked by researcher time? And there we’re much more bottlenecked by senior researcher time than junior researcher time.
3
Very hard to say, sorry! Personally, I’m most excited about inner alignment and interpretability and really want to see those having more resources. Generally, I’d also want to see a more even distribution of resources for exploration, diversification and value of information reasons. I expect different people would give wildly varying opinions.
Thanks for the feedback! Really glad to hear it was helpful de-confusion for people who’ve already engaged somewhat with AI Alignment, but aren’t actively researching in the field, that’s part of what I was aiming for.
I didn’t get much feedback on my categorisation, I was mostly trying to absorb other people’s inside views on their specific strand of alignment. And most of the feedback on the doc was more object-level discussion of each section. I didn’t get feedback suggesting this was wrong in some substantial way, but I’d also expect it to be considered ‘reasonable but controversial’ rather than widely accepted.
If it helps, I’m most uncertain about the following parts of this conceptualisation:
Separating power-seeking AI and inner misalignment, rather than merging them—inner misalignment seems like the most likely way this happens
Having assistance games as an agenda, rather than as a way to address the power-seeking AI or you get what you measure threat models
Not having recursive reward modelling as a fully fledged agenda (this may just be because I haven’t read enough about it to really have my head around it properly)
Putting reinforcement learning from human feedback under you get what you measure—this seems like a pretty big fraction of current alignment effort, and might be better put under a category like ‘narrowly aligning superhuman models’
It’s hard to be precise, but there’s definitely not an even distribution. And it depends a lot on which resources you care about.
A lot of the safety work at industry labs revolves around trying to align large language models, mostly with tools like reinforcement learning from human feedback. I mostly categorise this under you get what you measure, though I’m open to pushback there. This is very resource intensive, especially if you include the costs of training those large language models in the first place, and consumes a lot of capital, engineer time, and researcher time. Though much of the money comes from companies like Google, rather than philanthropic sources.
The other large collections of researchers are at MIRI, who mostly do deconfusion work, and CHAI, who do a lot of things, including a bunch of good field-building, but probably the modal type of work is on training AIs with assistance games? This is more speculative though.
Most of the remaining areas are fairly small, though these are definitely not clear-cut distinctions.
It’s unclear which of these resources are most important to track—training large models is very capital intensive, and doing anything with them is fairly labour intensive and needs good engineers. But as eg OpenPhil’s recent RFPs show, there’s a lot of philanthropic dollars available for researchers who have a credible case for being able to do good alignment research, suggesting we’re more bottlenecked by researcher time? And there we’re much more bottlenecked by senior researcher time than junior researcher time.
Very hard to say, sorry! Personally, I’m most excited about inner alignment and interpretability and really want to see those having more resources. Generally, I’d also want to see a more even distribution of resources for exploration, diversification and value of information reasons. I expect different people would give wildly varying opinions.