(Intro/​1) - My Understandings of Mechanistic Interpretability Notebook

I’m presently honing my skills in mechanistic interpretability (MI) through BAISC, and thought I’d start keeping a record of the things I’m learning and my contemplations. I’m framing this as a personal blog post, given that I don’t want to overwhelm the Forum with my less polished musings.

There are three main threads to this: a) a bit of background on my existing skills and why I find myself drawn towards MI, b) useful discoveries I’ve come across, and c) quick reflections and anticipations for the next stage of my journey.

Background Bits and Bobs

A recent retreat with the Global Challenges Project brought forth an intriguing encounter. I met someone who, although devoid of a formal background in Computer Science or Mathematics, displayed a striking understanding of technical AI safety work. Interacting with this person, heavily involved in safety and governance, led me to challenge a longstanding problem I’ve had: my law degree and non-technical foundation made me wary of engaging with technical AI safety.

This seems to be a shared sentiment among EAs, possibly morphing into an unhelpful heuristic. The problem is that those with a non-technical background, yet a desire to work on safety, may inadvertently restrict themselves to non-technical roles.

While it’s still early days, I suspect that the leap from starting at zero to getting stuck into interpretability or alignment work isn’t insurmountable. Another perspective could be to dedicate a three-month stint to try out different work in alignment: participate in an interpretability hackathon, kickstart a reading club on MI papers, and tackle problems in your spare time. After this, you can reassess if the technical path might suit you. This pretty much encapsulates my plan for the coming months.

I don’t want to be overly eager, though. This stuff is complex and challenging. I find work like MI fascinating. Trying to decipher transformers is exciting but it’s a tough nut to crack. The same goes for attempting to read Toy Models of Superposition and A Mathematical Introduction to Circuits. It required scouring other people’s notes, posing questions to BAISC folks, creating explanatory videos for myself, and perusing papers and LessWrong/​Alignment Forum posts until I got it. Alongside this, the perennial companion of imposter syndrome isn’t easy to shrug off. I reckon I manage it decently, but I acknowledge its potential to hit others harder.

Useful Discoveries

There’s quite a bit out there to dive into MI, a lot of which is available on Neel Nanda’s website. However, much of the value I’ve found has been from probing the minds of those who know more than me. This has been a more engaging endeavour than solitary paper reading. I’m also all for reaching out to others in the community for guidance. Being a newbie is tough, and given the sense that people in the alignment community highly value their time, I’ve been reticent to ask for help. But, contrary to my fears, people have been warm and generous. While you inevitably end up doing most of the work, it’s a good idea to seek help when you hit roadblocks.

On My To-Do List

I’m in the midst of replicating GPT-2 from scratch. Once I’m confident in that, I’ll turn my attention to some solid problems. A few ideas are already brewing. By the time of my next check-in, I hope to delve into the specific issue I’m grappling with. For now, I’m aiming to get comfortable with PyTorch, build strong intuitions, and then dive in headfirst.

No comments.