Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball—and the style of somewhat strawmanning everyone & strong words is a bit irritating.
Maybe I’m getting it wrong, but it seems the model you have for why everyone is not on the ball is something like “people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works” & “this is a type of problem where you can just throw money at it and attract better ML talent”.
I don’t think these two insights are promising.
Also, again, maybe I’m getting it wrong, but I’m confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)
Also, with what you find promising I’m confused what do you think the ‘real science’ to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn’t true imo).
Copy-pasting here from LW.
Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball—and the style of somewhat strawmanning everyone & strong words is a bit irritating.
Maybe I’m getting it wrong, but it seems the model you have for why everyone is not on the ball is something like “people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works” & “this is a type of problem where you can just throw money at it and attract better ML talent”.
I don’t think these two insights are promising.
Also, again, maybe I’m getting it wrong, but I’m confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)
Also, with what you find promising I’m confused what do you think the ‘real science’ to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn’t true imo).