I’ve considered running an “interpretability mine”—we get 50 interns, put them through a three week training course on transformers and our interpretability tools, and then put them to work on building mechanistic explanations of parts of some model like GPT-2 for the rest of their internship.
My usual joke is “GPT-2 has 12 attention heads per layer and 48 layers. If we had 50 interns and gave them each a different attention head every day, we’d have an intern-day of analysis of each attention head in 11 days.”
This is bottlenecked on various things:
having a good operationalization of what it means to interpret an attention head, and having some way to do quality analysis of explanations produced by the interns. This could also be phrased as “having more of a paradigm for interpretability work”.
having organizational structures that would make this work
building various interpretability tools to make it so that it’s relatively easy to do this work if you’re a smart CS/math undergrad who has done our three week course
I think there’s a 30% chance that in July, we’ll wish that we had 50 interns to do something like this. Unfortunately this is too low a probability for it to make sense for us to organize the internship.
I am glad we did not have 50 interns in July. But I’m 75% that we’ll run a giant event like this with at least 25 participants by the end of January. I’ll publish something about this in maybe a month.
I’m running Redwood Research’s interpretability research.
I’ve considered running an “interpretability mine”—we get 50 interns, put them through a three week training course on transformers and our interpretability tools, and then put them to work on building mechanistic explanations of parts of some model like GPT-2 for the rest of their internship.
My usual joke is “GPT-2 has 12 attention heads per layer and 48 layers. If we had 50 interns and gave them each a different attention head every day, we’d have an intern-day of analysis of each attention head in 11 days.”
This is bottlenecked on various things:
having a good operationalization of what it means to interpret an attention head, and having some way to do quality analysis of explanations produced by the interns. This could also be phrased as “having more of a paradigm for interpretability work”.
having organizational structures that would make this work
building various interpretability tools to make it so that it’s relatively easy to do this work if you’re a smart CS/math undergrad who has done our three week course
I think there’s a 30% chance that in July, we’ll wish that we had 50 interns to do something like this. Unfortunately this is too low a probability for it to make sense for us to organize the internship.
Now that it’s after July, did you ever end up wishing you had 50 interns to do something like this?
I am glad we did not have 50 interns in July. But I’m 75% that we’ll run a giant event like this with at least 25 participants by the end of January. I’ll publish something about this in maybe a month.
Cool!