Project Proposal: Working on understanding how AIs work by watching them think as they play video games. Needs python developers/possibly c++.
I’d like to extend my current technical alignment project stack in one of a few non-trivial ways and would love help from more experienced software engineers to do it.
- Post: https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
- GitHub: https://github.com/jbloomAus/DecisionTransformerInterpretability
I’m not sure what the spread of technical proficiency is or how interested people are in assisting with my research agenda, but I’ve made a list of what I think are solid engineering challenges that I would love to get help with. 1⁄2 is stuff I can do/manage, and 3 is something I would need assistance with from someone with more experience.
1. Re-implementing bespoke grid worlds such as AI safety grid worlds, proper mazes or novel environments in currently maintained/compatible packages (gymnasium and/or <inigrid) to study alignment-relevant phenomena in RL agents/agent simulators.
2. Implementing methods for optimizing inputs (feature visualization) for pytorch models/MiniGrid environments.
3. Develop an real-time mechanistic interpretability app for procgen games (ie: extend https://distill.pub/2020/understanding-rl-vision/#feature-visualization to game-time, interactive play with pausing). I have a streamlit app that does this for gridworlds which I can demo.
Further Details:
1. The AI Safety GridWorlds (https://github.com/deepmind/ai-safety-gridworlds) is more than 5 years old and implemented in DeepMind’s pycolab engine (https://github.com/deepmind/pycolab). I’d love to study them with the current mechanistic interpretability techniques implemented in TransformerLens and the Decision Transformer Interpretability codebase, however, getting this all working will take time so it would be cool if people were interested in smashing that out. Having proper mazes for agents to solve in Minigrid would also be interesting in order to test our ability to reverse engineer algorithms from models using current techniques.
2. Feature Visualization Techniques aren’t new but have been previously used on continuous input spaces like CNNs. However, recent work by Jessica Rumbelow (SolidGoldMagicarp post: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation#Prompt_generation ) has shown that it’s possible to perform this technique on discrete spaces such as word embeddings. Extending this to the discrete environments we have been studying or might study (see 1) may provide valuable insights. Lucent (lucid for pytorch) may also be useful for this.
3. The current interactive analysis app for Decision Transformer Interpretability is written in Streamlit and so runs very slowly. This is fine for grid world-type environments but won’t work for continuous procedurally generated environments like procgen (https://github.com/openai/procgen). Writing a procgen/python wrapper that provides live model analysis (with the ability to pause mid-game) will be crucial to further work.
Feel free to ask questions here!
Joseph Bloom
This project would be valuable if the costs outweighed the benefits.
It could be relatively expensive (in person-hours) to run (there might be a tonne of publications to vet!) and relies on us being good (low false positive, high recall) at identifying biohazards (my prior is that this is actually pretty hard and those biohazardous publications would happen anyway). We’d also need to worry about incentivising people to make it harder to tell that their work is dangerous.Biohazards are bad but preventing biohazards might have low marginal returns when some already exist. It’s not that any new biohazard is fine; it’s that marginal biohazard might be pretty rare (like something that advances the possibilities) relative to “acceptable” sort of non-marginal biohazards (i.e., another bad genome for something as bad as what’s already public knowledge). Other work might advance what’s possible without being a biohazard persay (i.e., AlphaFold).
I think a way to verify if this is a good project might be to talk to the Spiez lab. They run a biosecurity conference every year and invite anyone doing work that could be dangerous to attend.
I’m happy to chat more about it.
Thanks so much! Was hoping someone would do this soon!
Hi!
It’s great that you are trying to make these kinds of decisions with impact in mind!
I have a comp bio background but more in proteomics and have spent time this year looking at different ways to have a large impact, although my own focus was much more on pandemics preparedness / x-risk.
Probably this problem is underspecified. It’s just very hard for anyone at this level of abstraction to make the decision for you. Details like your relationship with the supervisor, other lab members for example could be critical. It does sound like you like the first option but I’d encourage you to test your hypothesis thoroughly (or proportionally to the subsequent time investment).
However, some guiding principles may help:
Speak to current lab members/students of either lab. If you feel very confident that they are sending out good cultural and intellectual vibes then you’re time is a much safer bet there.
Fieldwise, meta-genomics seems likely to be very useful in pandemic preparedness (see SecureDNA) so if your work has higher inner product (more in common) with those kinda of projects then I’d see that as a concretely safer bet.
Given your other interests, I’d definitely go speak to more bio experts. Book an appt with the EA consult a bio expert (if it’s still open) or look for EAs who you can chat and contact them.
EffectiveThesis might have some useful content too. https://effectivethesis.org/
Good luck and all the best!
Hi there,
Interesting idea. I think there’s a lot of possible commentary or answers, so I’ll provide some quick thoughts based on ~1 month of reading/upskilling in AI/Bio-related X-risk, which I began since I received an FTX Future Fund Regrant.
Does anyone have any advice before I start this project?
Do experiments. These can inform your estimates of how valuable a podcast would be to others, how useful it would be to you and how much effort it would require. This post is a great experiment also, so kudos!
In particular, are there any resources you recommend for teaching myself about machine learning, genomics, or politics?
There are lots of different materials online for learning about these general topics. I would highly suggest you start with having a thorough understanding of the relevant x-risk cause areas without getting into technical details first, followed by learning about these technical topics if/when they appear to be most appropriate.
I’m interested in whether this particular piece of advice in the previous paragraph is contentious (with the other perspective being “go learn lots of general skills before getting more context on x-risks”). Still, I think that might be a costly approach involving spending lots of time learning extraneous detail with no apparent payoff.
For AI:
I think the best place to start is the Cambridge AGI Safety Fundamentals course (which has technical and governance variations). You don’t need a lot of Deep Learning expertise to do the course, and the materials are available online until they run one.
For Bio:
Tessa curated A Biosecurity and Biorisk Reading+ List, which covers several domains, including genomics.
And are there any hidden risks I’m not considering that might make this idea worse than it seems?
Other than not achieving your goals, or being costly, mitigated by starting small and doing experiments, the most significant potential risk is some information hazard. If you focus on pre-requisite skills, then info hazards might be less likely. There are dangers in being too careful around info hazards, so maybe the best action is to share podcasts with a small group of info hazard-aware community members first to check.
Good luck! And please feel free to reach out if you’d like to discuss this further.
That’s fair. I’ll keep thinking about it but this was helpful, thanks.
My general sense of the 80k handbook is that it is very careful to emphasise uncertainty and leaves room for people to project existing beliefs without updating.
For example:
Working on this issue seems to be among the best ways of improving the long-term future we know of, but all else equal, we think it’s less pressing than our highest priority areas.
I value the integrity that 80k has here, but I think something shorter, with more direct comparisons to other cause areas, might be more effective.
As an author on this post,I think this is a surprisingly good summary. Some notes:
While all of the features are fictional, the more realistic ones are not far from reality. We’ve seen scripture features of various kinds in real models. A scripture intersect Monty Python feature just wouldn’t be that surprising.
Some of the other features were more about tying in interesting structure in reality than playing anything else (eg criticism of criticism feature).
In terms of the absurdities of feature interpretation, I think the idea was to highlight awareness of possible flaws like buying into overly complicated stories we could tell if we work too hard to explain our results. We’re not sure what we’re doing yet in this pre-paradigmatic science so having a healthy dose of self-awareness is important!
Thanks Vael!
Thanks for the answer. Does this idea of looking at it in that hypothetical word framing have a related post somewhere?
This is fantastic, thank you!
Is there a summary of the main insights/common threads from the interviews?
I’d like to see a more comprehensive model for what biosecurity risk looks like that can motivate a comparison of project ideas. In the absence of that, it’s really hard to say where we get the most benefit.
Thanks, Geoffrey, I appreciate the response.
It was definitely not my goal to describe how experienced people might “unlearn what they have learned”, but I’m not sure that much of the advice changes for experienced people.
“Unlearning” seems instrumentally useful if it makes it easier for you to contribute/think well but using your previous experience might also be valuable. For example, REFINE thinks that conceptual research is not varied enough and is looking for people with diverse backgrounds.
For example, apart from young adults often starting with the same few bad ideas about AI alignment, established researchers from particular fields might often start with their own distinctive bad ideas about AI alignment—but those might be quite field-dependent. For example, psych professors like me might have different failure modes in learning about AI safety than economics professors, or moral philosophy professors.
This is a good example and I think generally I haven’t addressed that failure mode in this article. I’m not aware of any resources for mid or late-career professionals transitioning into alignment but I will comment here if I hear of such a resource, or someone else might suggest a link.
I’m not involved with running this course but I’ve watched the online lectures and there’s a decent amount of content, albeit at a high level. If the course is run with rolling cohorts then the inconvenience from the short notice is offset by being able to participate or facilitate a later cohort.
Personally, I think developing courses while running them is a good way to make sure you’re creating value and updating based on feedback as opposed to putting in too much effort before testing your ideas.
Sorry for the slow reply.
Talking about allocation of EA’s to cause areas.
I agree that confidence intervals between x-risks are more likely to overlap. I haven’t really looked into super-volcanoes or asteroids and I think that’s because what I know about them currently doesn’t lead me to believe they’re worth working on over AI or Biosecurity.
Possibly, a suitable algorithm would be to defer to/check with prominent EA organisations like 80k to see if they are allocating 1 in every 100 or every 1000 EAs to rare but possibly important x-risks. Without a coordinated effort by a central body, I don’t see how you’d calibrate adequately (use a random number generator and if the number is less than some number, work on a neglected but possibly important cause?).
My thoughts on EA allocation to cause areas have evolved quite a bit recently (partly due to talking 80k and others, mainly in biosecurity). I’ll probably write a post with my thoughts, but the bottom line is that, basically, the sentiment expressed here is correct and that it’s easier socially to have humility in the form of saying you have high uncertainty.
Responding to the spirit of the original post, my general sense is that plenty of people are not highly uncertain about AI-related x-risk—you might have gotten that email from 80k titled “A huge update to our problem profile — why we care so much about AI risk”. That being said, they’re still using phrases like “we’re very uncertain”. Maybe the lack of uncertainty about some relevant facts is lower than their decision rule. For example, in the problem profile, they write:Overall, our current take is that AI development poses a bigger threat to humanity’s long-term flourishing than any other issue we know of.
Different Views under Near-Termism
If you don’t buy longtermism, you probably still care about x-risks, but your rejection of longtermism massively affects the relative importance of x-risks compared to nearterm problems, which affects cause prioritisation.
This seems tempting to believe, but I think we should substantiate it. What current x-risks are not ranked higher than non-x-risks (or how much less of a lead do they have) relative to non-x-risks causes from a near-term perspective?
I think this post proposes a somewhat detailed summary of how your views may change under a transformation from long-termist to near-termist. Scott says:
Does Long-Termism Ever Come Up With Different Conclusions Than Thoughtful Short-Termism?
I think yes, but pretty rarely, in ways that rarely affect real practice.
His arguments here are convincing because I find an AGI event this century likely. If you didn’t, then you would disagree. Still, I think that even were AI not to have short timelines, other existential risks like engineered pandemics, super-volcanoes or asteroids might have milder only catastrophic variations, which near-termists would equally prioritise, leading to little practical variation in what people work on.
Talking about different cultures and EA
Similarly, I don’t expect diversity of thought to introduce entirely new causes to EA or lead to current causes being entirely abandoned, but I do expect it to affect cause prioritisation.
I don’t entirely understand what East Asian cultures mean by balance / harmony so can’t tell how it would affect cause prioritisation, I just think there would be an effect.
Can you reason out how “there would be an effect”?
Thanks for clarifying.
So I’m an example of someone in that position (I’m trying to work out how to contribute via direct work to a cause area) so I appreciate the opportunity to discuss the topic.
Upon reflection, maybe the crux of my disagreement here is that I just don’t agree that the uncertainty is wide enough to effect the rankings (except in each tier) or to make the direct-work decision rule robust to personal fit.
I think that X-risks have non-overlapping confidence intervals with non-x-risks because of the scale of the problem, and I don’t feel like this changes from a near-term perspective. Even small chances of major catastrophic events this century seen to dwarf other problems.
80k’s second top priority areas are Nuclear security, Climate Change (extreme) and Improving Institutional decision making. For the first two, these seem to be associated with major catastrophe’s (maybe not x-risks) which also might be considered not to overlap with the next set of issues (factory farming/global health).
With respect to concerns that demographics might be heavily affecting cause prioritisation, I think it would be helpful to have specific examples of causes you think are under-estimated and the biases associated with them.
For example, I’ve heard lots of different arguments that x-risks are concerning even if you don’t buy into long-termism. To a similar end, I can’t think of any causes that would be under-valued because of not caring adequately about balance/harmony.
Interesting post, I’m trying to understand it better. I think the cause area sounds good, but I don’t feel confident about the chance that there’s a huge amount of free-energy lying around (thinking in terms of Inadequate Equilibria). I feel like the heart of the argument is that “cultural” shift akin to what helped Space-X succeed could solve similar problems in Biopharma R&D.
A detailed plan substantiating a trial EA project would help establish the tractability more substantially.
Some specific points I’d like to clarify:
What exactly are the “emergent properties of complex systems” mentioned in the neglectedness section? It sounds like perverse incentives, unexplained lack of translation of technological progress to profitability and maybe backfiring legislation?
“The seeds of this are most likely to be found within younger, more innovative for-profit companies, so that is where EA should direct its resources.”—Can you point to any examples of such companies not succeeding for lack of funding? I think this could be a really strong point if you can point to examples of Space-X equivalents in biopharma that didn’t get off the ground for a lack of “a tonne of private capital and public subsidies”?
“This essay advocates here a more distributed effort, one befitting the EA community.”—this sounds potentially very time-expensive. If money isn’t the limiting resource, EA community member time/focus might be. What is your sense for the number of people/amount of time it would take to test this hypothesis?
I know a space-X engineer who also attributes culture to their success. I’m willing to accept this as a plausible catalyst for radical progress (although exactly what cultural hallmarks you mean might need to be specified).
I think if EAs better appreciated uncertainty when prioritising causes, people’s careers would span a wider range of cause areas.
I’ve got a strong intuition that this is wrong, so I’m trying to think it through.
To argue that EA’s underestimate uncertainty, you need to directly observe their uncertainty estimates (and have knowledge of the correct level of uncertainty to have). For example, if the community was homogenous and all assigned a 1% chance to Cause X being the most important issue (I’m deliberately trying not to deal with how to measure this) and there was a 99% chance of cause Y being the most important issue, then all individuals would choose to work on Cause Y. If the probabilities were 5% X and 95% you’d get the same outcome. This is because individuals are making single choices.
Now, if there was a central body coordinating everyone’s efforts, in the first scenario, it still wouldn’t follow that 1% of people would get allocated to cause Y. Optimal allocation strategy aside, there isn’t this clean relationship between uncertainty and decision rules.
I think 80 000 Hours could emphasise uncertainty more, but also that the EA community as a whole just needs to be more conscious of uncertainty in cause prioritisation.
I think 80k is already very conscious of this (based on my general sense of 80k materials). Global priorities research is one of their 4 highest priorities areas and it’s precisely about having more confidence about what is the top priority.
I think something that would help me understand better where you are coming from is to hear more about what you think the decision rules are for most individuals, how they are taking their uncertainty into account and more about precisely how gender/culture interacts with cause area uncertainty in creating decisions.
A bit late here but I was looking into it and found this (https://survivalandflourishing.fund/s-process):
Hi all, I’m Joseph. I have a double degree in computational biology/statistics, have RA’d in protein engineering (structure/dynamics) and worked in proteomics (LCMS). Currently on an FTX regrant to find ways to help with Biosecurity/AI. I’m focusing on upskilling in AI, but keen to keep discussing biosecurity.