LW server reports: not allowed.
This probably means the post has been deleted or moved back to the author's drafts.
this sounds confused: people who do interpretability already know the model weights (more like they try to interpret the model weights). They are not learning the model weights from some outer behaviour of the system or anything like that.
this sounds confused: people who do interpretability already know the model weights (more like they try to interpret the model weights). They are not learning the model weights from some outer behaviour of the system or anything like that.