Project Proposal: Working on understanding how AIs work by watching them think as they play video games. Needs python developers/possibly c++.
I’d like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.
- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretability
I’m not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I’ve made a list of what I think are solid engineering challenges that I would love to get help with. 1⁄2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.
1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.
2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.
3. Develop an real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.
Further Details:
1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.
2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.
3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.
Feel free to ask questions here!
Joseph Bloom
Open Source Contributions to AI Safety (option 2)
Open Source Contributions to AI Safety (option 1)
Hi all, I’m Joseph. I have a double degree in computational biology/statistics, have RA’d in protein engineering (structure/dynamics) and worked in proteomics (LCMS). Currently on an FTX regrant to find ways to help with Biosecurity/AI. I’m focusing on upskilling in AI, but keen to keep discussing biosecurity.
Thanks so much! Was hoping someone would do this soon!
Hi!
It’s great that you are trying to make these kinds of decisions with impact in mind!
I have a comp bio background but more in proteomics and have spent time this year looking at different ways to have a large impact, although my own focus was much more on pandemics preparedness / x-risk.
Probably this problem is underspecified. It’s just very hard for anyone at this level of abstraction to make the decision for you. Details like your relationship with the supervisor, other lab members for example could be critical. It does sound like you like the first option but I’d encourage you to test your hypothesis thoroughly (or proportionally to the subsequent time investment).
However, some guiding principles may help:
Speak to current lab members/students of either lab. If you feel very confident that they are sending out good cultural and intellectual vibes then you’re time is a much safer bet there.
Fieldwise, meta-genomics seems likely to be very useful in pandemic preparedness (see SecureDNA) so if your work has higher inner product (more in common) with those kinda of projects then I’d see that as a concretely safer bet.
Given your other interests, I’d definitely go speak to more bio experts. Book an appt with the EA consult a bio expert (if it’s still open) or look for EAs who you can chat and contact them.
EffectiveThesis might have some useful content too. https://effectivethesis.org/
Good luck and all the best!
Thanks Agustin,
I appreciate the clarification and this kind of detail (“people with experience working on climate change research, activism or public policy” as opposed to others).
Based on this thread, I think we’d be looking for a document that meets the following criteria:
Extends/Summarises current EA material on climate change so that it’s clear that EA has made serious attempts to assess it.
A nuanced explanation for the ITN framework, explaining how much of the work on climate change is not-neglected, and which observations might justify working on climate change over other cause areas.
Some description of other EA cause areas and links to similar reasoning which may explain why they are prioritised by some EAs.
Such a document should also be simple enough to be linked as introductory material to someone not familiar with EA. It would also be valuable to test such a document/set of arguments on some climate activists or even iterate based on their feedback in order to be more effective.
I’m definitely not the person to write this, but I could ask around a few places to see if anyone is keen to work on it. It sounds like our prior is that this is likely enough to be valuable, and simple enough to attempt, that it’s worth a shot.
That’s fair. I’ll keep thinking about it but this was helpful, thanks.
My general sense of the 80k handbook is that it is very careful to emphasise uncertainty and leaves room for people to project existing beliefs without updating.
For example:
Working on this issue seems to be among the best ways of improving the long-term future we know of, but all else equal, we think it’s less pressing than our highest priority areas.
I value the integrity that 80k has here, but I think something shorter, with more direct comparisons to other cause areas, might be more effective.
Thanks Vael!
Thanks for the answer. Does this idea of looking at it in that hypothetical word framing have a related post somewhere?
[Question] Does EA need better explanations of why Climate Change isn’t so popular a cause area?
This is fantastic, thank you!
Is there a summary of the main insights/common threads from the interviews?
I’d like to see a more comprehensive model for what biosecurity risk looks like that can motivate a comparison of project ideas. In the absence of that, it’s really hard to say where we get the most benefit.
This project would be valuable if the costs outweighed the benefits.
It could be relatively expensive (in person-hours) to run (there might be a tonne of publications to vet!) and relies on us being good (low false positive, high recall) at identifying biohazards (my prior is that this is actually pretty hard and those biohazardous publications would happen anyway). We’d also need to worry about incentivising people to make it harder to tell that their work is dangerous.Biohazards are bad but preventing biohazards might have low marginal returns when some already exist. It’s not that any new biohazard is fine; it’s that marginal biohazard might be pretty rare (like something that advances the possibilities) relative to “acceptable” sort of non-marginal biohazards (i.e., another bad genome for something as bad as what’s already public knowledge). Other work might advance what’s possible without being a biohazard persay (i.e., AlphaFold).
I think a way to verify if this is a good project might be to talk to the Spiez lab. They run a biosecurity conference every year and invite anyone doing work that could be dangerous to attend.
I’m happy to chat more about it.
Thanks, Geoffrey, I appreciate the response.
It was definitely not my goal to describe how experienced people might “unlearn what they have learned”, but I’m not sure that much of the advice changes for experienced people.
“Unlearning” seems instrumentally useful if it makes it easier for you to contribute/think well but using your previous experience might also be valuable. For example, REFINE thinks that conceptual research is not varied enough and is looking for people with diverse backgrounds.
For example, apart from young adults often starting with the same few bad ideas about AI alignment, established researchers from particular fields might often start with their own distinctive bad ideas about AI alignment—but those might be quite field-dependent. For example, psych professors like me might have different failure modes in learning about AI safety than economics professors, or moral philosophy professors.
This is a good example and I think generally I haven’t addressed that failure mode in this article. I’m not aware of any resources for mid or late-career professionals transitioning into alignment but I will comment here if I hear of such a resource, or someone else might suggest a link.
I recently spoke to an applied research engineer at DeepMind who I could put you in touch with. My understanding is that probably you could make better contributions to minimising AI x-risk elsewhere unless you are directly involved with the AI safety team. This is highly dependent on the details of your other potential avenues for contribution, and the exact role. For example, if you end up working very closely with the AI safety team, then this would be a more valuable role than if you were working elsewhere in DeepMind.
Feel free to message me and I’ll connect you.
(My suggestions) On Beginner Steps in AI Alignment
I’m not involved with running this course but I’ve watched the online lectures and there’s a decent amount of content, albeit at a high level. If the course is run with rolling cohorts then the inconvenience from the short notice is offset by being able to participate or facilitate a later cohort.
Personally, I think developing courses while running them is a good way to make sure you’re creating value and updating based on feedback as opposed to putting in too much effort before testing your ideas.
As an author on this post,I think this is a surprisingly good summary. Some notes:
While all of the features are fictional, the more realistic ones are not far from reality. We’ve seen scripture features of various kinds in real models. A scripture intersect Monty Python feature just wouldn’t be that surprising.
Some of the other features were more about tying in interesting structure in reality than playing anything else (eg criticism of criticism feature).
In terms of the absurdities of feature interpretation, I think the idea was to highlight awareness of possible flaws like buying into overly complicated stories we could tell if we work too hard to explain our results. We’re not sure what we’re doing yet in this pre-paradigmatic science so having a healthy dose of self-awareness is important!