There seem like two obvious models: 1) intractability model, where AGI = doom and the only safe move is not to make it 2) race / differential progress model, where safety needs to be ahead of capabilities by some amount, before capabilities reaches point X
As far as I can tell, alignment is advancing a lot slower per researcher than capabilities. So even if you contribute 1 year on capabilities and 10 on alignment, your effect under differential progress was just bad, and your effect under intractability was badder.
I’m curious how much the “having aligned people in the room is good” theory can be assessed already. I personally am not a big buyer of it. For example this phenomenon doesn’t seem visible in the manhattan project or following nuclear policy.
There seem like two obvious models:
1) intractability model, where AGI = doom and the only safe move is not to make it
2) race / differential progress model, where safety needs to be ahead of capabilities by some amount, before capabilities reaches point X
As far as I can tell, alignment is advancing a lot slower per researcher than capabilities. So even if you contribute 1 year on capabilities and 10 on alignment, your effect under differential progress was just bad, and your effect under intractability was badder.
I’m curious how much the “having aligned people in the room is good” theory can be assessed already. I personally am not a big buyer of it. For example this phenomenon doesn’t seem visible in the manhattan project or following nuclear policy.