I’m not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I’ve made a list of what I think are solid engineering challenges that I would love to get help with. 1⁄2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.
1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.
2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.
1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.
2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.
3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.
Project Proposal: Working on understanding how AIs work by watching them think as they play video games. Needs python developers/possibly c++.
I’d like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.
- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretability
I’m not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I’ve made a list of what I think are solid engineering challenges that I would love to get help with. 1⁄2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.
1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.
2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.
3. Develop an real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.
Further Details:
1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.
2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.
3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.
Feel free to ask questions here!
Thanks for your detailed project idea!!