Neuronpedia - AI Safety Game - Effective Altruism forum viewer

EA CONTEXT

Neuronpedia is funded by a short-term grant from an EA fund which ends in a few weeks. In lieu of the traditional “final report/paper” they requested, I am posting my project on the EA forum, so that you can play the game, provide feedback, and hopefully contribute.

I am new this forum, so I apologize if this is not the right format/place for this.

WHAT’S NEURONPEDIA?

Neuronpedia is an AI safety game that documents and explains neurons (and features found with Sparse Autoencoders) in modern AI models. It aims to be the Wikipedia for interpreting AI models, where the contributions come from users playing a game. Neuronpedia also wants to connect the general public to AI safety, so it’s designed to not require any technical knowledge to play.

OBJECTIVES

Crowdsource AI interpretability to help build safer AI (“Geoguessr for AI”)
Increase public engagement, awareness, and education in AI safety

LONG TERM VISION

Anyone can upload their LLM, and then Neuronpedia does all the interpretability work: generate activation texts, make features/directions, generate explanations, collect human/player explanations, scoring explanations, and produce/export all the data via an API or to files. Basically, a one stop shop for interpretability.

Neuronpedia is not that far off from this, feature-wise. We have done every part of those tasks, but haven’t built the tooling to do them all in one streamlined process.

STATUS

Neuronpedia’s v1.0 is ready for public release and will be posted in more places in the coming days.
During the beta period, we collected 3,000+ explanations from players, over half of which beat GPT-4′s explanations, when scored using OpenAI’s Automated Interpretability. Over 500 votes have been collected, and over 2,000 explanations have been verified.
Initially only explaining neurons, Neuronpedia now also has features/directions found with Sparse Autoencoders, thanks to researchers Hoagy Cunningham, Logan Riggs, and Aiden Ewart.

WHAT YOU CAN DO

Play @ neuronpedia.org—you can log in with email, Apple, Google, or GitHub.
Tell a friend—even those who are not interested in AI.
Give feedback, ideas, and ask questions.
Join Discord—https://discord.gg/kpEJWgvdAx

FEATURES

Pixel-based role-playing game complete with graphics/animations, leveling up, equipment, potions, etc.
Explain Mode—View activations for a neuron/feature and explain them.
Dig Mode—Verify the accuracy of explanations.
In-game Shop—Earning “Bones” (in-game currency) by playing Dig Mode lets you buy stuff like colorful hats for you and your character’s pet, etc.
Leaderboard—Compete against other players.
Mobile Optimized—Play on the bus or when you’re having trouble falling asleep from AI nightmares.
Sharing / Summary
View Data—Anyone can browse the current results, vote, and even comment on neurons/features: http://neuronpedia.org/gpt2-small
Test New Activations—You can test new activation texts for every neuron/feature.

MOST SIGNIFICANT FLAWS

Scoring—OpenAI’s Automated Interpretability is good, but it often doesn’t take context of tokens into account. For example, if the highest scoring token of the top 20 activations are the word “cat” in the fragments “blue cat”, “white cat”, “black cat”, the highest scoring explanation should be “cat with a color before it”—but often times the explanation “the word ‘cat’” will have a higher score.
- Solution 1: Train a model specific to scoring that also look at scoring. This is a work in progress—we have enough activations/explanations that we can do this.
- Solution 2: User scoring. This is implemented via “Dig Mode” in the game, but I haven’t had time to connect it to the “Explain Mode” to complete the loop of generating a score. Also, it’s asynchronous, so it doesn’t immediately give the user feedback.
- Solution 3: Regex or other non-AI based scoring. This was suggested by player duck-master, but I haven’t had time to figure out the details or implement it.
Polysemanticity—I’m grateful for the features/directions from Hoagy/Logan/Aiden—they are less polysemantic than neurons—but they’re limited in number (~3000 non-dead features, ~800 of those are somewhat low activating) due to our lack of GPU resources and their lack of time. I believe that more compute can result in even better features.
Onboarding is lacking / rough—there should be a much more graceful transition in difficulty for the game from onboarding to the real game—e.g, give users some fake puzzles first so they can get the hang of it.
- Solution: Fairly straightforward to do this but haven’t had time.
Optimizing for Fun vs Usefulness—Building a game around what AI safety researchers need has been a tough but interesting challenge. In a pure game, your only objective is to get the player glued to the screen, and you can adjust whatever you like to make this happen—for example, you could generate thousands of really easy puzzles (e.g, angry birds, clicker games) that make the player feel good. Word games like Wordle have a definite solution, making “winning” more satisfying/fair. For Neuronpedia, the difficulty is that there is not a “right answer”, or you won’t get the best answer, or the scoring/results will seem unfair.
- Solution—There’s a lot to say here, but I’m optimistic and have seen that tweaking gameplay using a bit of creativity/cleverness can overcome these problems. Also—there are some very involved players who are happy to contribute to AI safety even if the game isn’t as engaging as Zelda or Mario.

WHAT’S NEXT FOR NEURONPEDIA

Features (other than the ones mentioned in “flaws” sections above)
- Data / ML
  - Let anyone upload their own LLMs
  - Let anyone upload their own features/directions
  - More and Better Models, Layers, Directions/Features
- Game
  - New Game Types/Modes—PvP, teams, contests, etc
    - Turn other Interp needs into new game modes
  - Weekly Summary and other notifications
  - Character / Profile Refresh
  - Challenges / Contests
  - Achievements
  - Unlockables (robes, shoes, gloves, wands, etc)
- General
  - Code cleanup/backend optimizations/fixes
Funding—Neuronpedia’s three-month EA grant runs out soon. I’ve spent personal funds on keeping the site up, servers/inference machines, and other bills. I’d really like to keep working on it full time, so let me know (johnny@neuronpedia.org) if you’re interested in supporting it—or if you can recommend me for a grant. I’m also open to investment for equity, but I’m not sure if an AI safety game will beat the S&P 500 - though you could potentially profit from reducing humanity’s AI risk!