Neel Nanda answers What are the coolest topics in AI safety, to a hopelessly pure mathematician?

Neel Nanda 7 May 2022 20:50 UTC
15 points
0 ∶ 0
I’m did a pure maths undergrad and recently switched to doing mechanistic interpretability work—my day job isn’t exactly doing maths, but I find it has a strong aesthetic appeal in a similar way. My job is not to train an ML model (with all the mess and frustration that involves), it’s to take a model someone else has trained, and try to rigorously understand what is going on with it. I want to take some behaviour I know it’s capable of and understand how it does that, and ideally try to decompile the operations it’s running into something human understandable. And, fundamentally, a neural network is just a stack of matrix multiplications. So I’m trying to build tools and lenses for analysing this stack of matrices, and converting it into something understandable. Day-to-day, this looks like having ideas for experiments, writing code and running them, getting feedback and iterating, but I’ve found a handful of times where having good intuitions around linear algebra, or how gradients work, and spending some time working through algebra has been really useful and clarifying.
If you’re interested in learning more, Zoom In is a good overview of a particular agenda for mechanistic interpretability in vision models (which I personally find super inspiring!), and my team wrote a pretty mathsy paper giving a framework to breakdown and understand small, attention-only transformers (I expect the paper to only make sense after reading an overview of autoregressive transformers like this one). If you’re interested in working on this, there are currently teams at Anthropic, Redwood Research, DeepMind and Conjecture doing work along these lines!
- Jenny K E 10 May 2022 22:10 UTC
  6 points
  0 ∶ 0
  Parent
  Thanks very much for the suggestions, I appreciate it a lot! Zoom In was a fun read—not very math-y but pretty cool anyway. The Transformers paper also seems kind of fun. I’m not really sure whether it’s math-y enough for me to be interested in it qua math...but in any event it was fun to read about, which is a good sign. I guess “degree of mathiness” is only one neuron of several neurons sending signals to the “coolness” layer, if I may misuse metaphors.