Thus we EAs should vigorously investigate whether this concern is well-founded
I am not an EA, but I am an alignment researcher. I see only a small sliver of alignment research for which this concern would be well-founded.
To give an example that is less complicated than my own research: suppose I design a reward function component, say some penalty term, that can be added to an AI/AGI reward function to make the AI/AGI more aligned. Why not publish this? I want to publish it widely so that more people will actually design/train their AI/AGI using this penalty term.
Your argument has a built-in assumption that it will be hard to build AGIs that will lack the instrumental drive to protect themselves from being switched off, or protect themselves from having their goals changed. I do not agree with this built-in assumption, but even if I did agree, I would see no downside to publishing alignment research about writing better reward functions.
I am not an EA, but I am an alignment researcher. I see only a small sliver of alignment research for which this concern would be well-founded.
To give an example that is less complicated than my own research: suppose I design a reward function component, say some penalty term, that can be added to an AI/AGI reward function to make the AI/AGI more aligned. Why not publish this? I want to publish it widely so that more people will actually design/train their AI/AGI using this penalty term.
Your argument has a built-in assumption that it will be hard to build AGIs that will lack the instrumental drive to protect themselves from being switched off, or protect themselves from having their goals changed. I do not agree with this built-in assumption, but even if I did agree, I would see no downside to publishing alignment research about writing better reward functions.