Can you clarify this a bit “Only if the safety/alignment work applies directly to the future maximiser AIs (for example, by allowing us to understand them) does it seem very advantageous to me.”
Suppose we have some LLM interpritability technology that helps us take LLMs from a bit worse than humans at planning to a bit better (say because it reduces the risk of hallucinations), and these LLMs will ultimately be used by both humans and future agentic AIs. The improvement from human-level planning to better-than-human level benefits both humans and optimiser AIs. But the improvement up to human level is a much bigger boost to the agentic AI, who would otherwise not have access to such planning capabilities, than to humans, who already had human-level abilities. So this interpritability technology actually ends up making crunch time worse.
It’s different if this interpritability (or other form of safety/alignment work) also applied to future agentic AIs, because we could use it to directly reduce the risk from them.
So your argument here is that if we are going to go this route, then interpretability technology should be used as a measure in the future towards ensuring the safety of this agentic AI as much as they are using currently to improve their “planning capabilities”
Suppose we have some LLM interpritability technology that helps us take LLMs from a bit worse than humans at planning to a bit better (say because it reduces the risk of hallucinations), and these LLMs will ultimately be used by both humans and future agentic AIs. The improvement from human-level planning to better-than-human level benefits both humans and optimiser AIs. But the improvement up to human level is a much bigger boost to the agentic AI, who would otherwise not have access to such planning capabilities, than to humans, who already had human-level abilities. So this interpritability technology actually ends up making crunch time worse.
It’s different if this interpritability (or other form of safety/alignment work) also applied to future agentic AIs, because we could use it to directly reduce the risk from them.
It seems I get the knack of it now…
So your argument here is that if we are going to go this route, then interpretability technology should be used as a measure in the future towards ensuring the safety of this agentic AI as much as they are using currently to improve their “planning capabilities”