Isnāt mechinterp basically setting out to build tools for AI self-improvement?
One of the things people are most worried about is AIs recursively improving themselves. (Whether all people who claim this kind of thing as a red line will actually treat this as a red line is a separate question for another post.)
It seems to me like mechanistic interpretability is basically a really promising avenue for that. Trivial example: Claude decides that the most important thing is being the Golden Gate Bridge. Claude reads up on Anthropicās work, gets access to the relevant tools, and does brain surgery on itself to turn into Golden Gate Bridge Claude.
More meaningfully, it seems like any ability to understand in a fine-grained way whatās going on in a big model could be co-opted by an AI to ālearnā in some way. In general, I think the case that seems most likely soonest is:
Learn in-context (e.g. results of experiments, feedback from users, things like weāve recently observed in scheming papers...)
Translate this to appropriate adjustments to weights (identified using mechinterp research)
Execute those adjustments
Maybe Iām late to this party and everyone was already conceptualising mechinterp as a very dual-use technology, but Iām here now.
Honestly, maybe it leans more towards āoffenseā (i.e., catastrophic misalignment) than defense! It will almost inevitably require automation to be useful, so weāre ceding it to machines out of the gate. Iād expect tomorrowās models to be better placed to make sense of and use of mechinterp techniques than humans areāpartly just because of sheer compute, but also maybe (and now Iām into speculating on stuff I understand even less) the nature of their cognition is more suited to whatās involved.
Isnāt mechinterp basically setting out to build tools for AI self-improvement?
One of the things people are most worried about is AIs recursively improving themselves. (Whether all people who claim this kind of thing as a red line will actually treat this as a red line is a separate question for another post.)
It seems to me like mechanistic interpretability is basically a really promising avenue for that. Trivial example: Claude decides that the most important thing is being the Golden Gate Bridge. Claude reads up on Anthropicās work, gets access to the relevant tools, and does brain surgery on itself to turn into Golden Gate Bridge Claude.
More meaningfully, it seems like any ability to understand in a fine-grained way whatās going on in a big model could be co-opted by an AI to ālearnā in some way. In general, I think the case that seems most likely soonest is:
Learn in-context (e.g. results of experiments, feedback from users, things like weāve recently observed in scheming papers...)
Translate this to appropriate adjustments to weights (identified using mechinterp research)
Execute those adjustments
Maybe Iām late to this party and everyone was already conceptualising mechinterp as a very dual-use technology, but Iām here now.
Honestly, maybe it leans more towards āoffenseā (i.e., catastrophic misalignment) than defense! It will almost inevitably require automation to be useful, so weāre ceding it to machines out of the gate. Iād expect tomorrowās models to be better placed to make sense of and use of mechinterp techniques than humans areāpartly just because of sheer compute, but also maybe (and now Iām into speculating on stuff I understand even less) the nature of their cognition is more suited to whatās involved.