Hmm, I see what you mean, but I think I disagree.[1] It’s not obvious that “it’s not obvious that” is bad. And I think it makes sense sometimes to write a post that appeals to authority—but you should be very aware of when you’re doing that.
What I somewhat dislike about this tweet is that it mixes up trying to make you defer to it while also providing gears-level arguments. It invites the mistake of getting woozled by the technical arguments just because they came from “experts”.[2]
Additionally, on the object level, I kinda think alignment researchers should be trying to theoretically innovate on the capabilities frontier, just keep it secret and don’t tell anyone you don’t trust very highly. It’s hard to try to align something you don’t have a deep understanding of.
This paragraph is mostly about communicating the patterns, but I still lead with “what I somewhat dislike”—referring to my beliefs rather than communicating content.
We should ideally have a standard back-to-object-level codeword that can help defuse discussions from increasingly becoming more and more about belief-states rather than about communicating patterns. Something like… if you say “blueberry muffins”, your interlocutor will know to stop asking or revealing about (dis)agreement or estimates.
Hmm, I see what you mean, but I think I disagree.[1] It’s not obvious that “it’s not obvious that” is bad. And I think it makes sense sometimes to write a post that appeals to authority—but you should be very aware of when you’re doing that.
What I somewhat dislike about this tweet is that it mixes up trying to make you defer to it while also providing gears-level arguments. It invites the mistake of getting woozled by the technical arguments just because they came from “experts”.[2]
Additionally, on the object level, I kinda think alignment researchers should be trying to theoretically innovate on the capabilities frontier, just keep it secret and don’t tell anyone you don’t trust very highly. It’s hard to try to align something you don’t have a deep understanding of.
Notice how this whole comment is largely about my state of belief/agreement, rather than about the patterns themselves.
This paragraph is mostly about communicating the patterns, but I still lead with “what I somewhat dislike”—referring to my beliefs rather than communicating content.
We should ideally have a standard back-to-object-level codeword that can help defuse discussions from increasingly becoming more and more about belief-states rather than about communicating patterns. Something like… if you say “blueberry muffins”, your interlocutor will know to stop asking or revealing about (dis)agreement or estimates.