Here’s more info on Open Phil’s 4 requests for proposals for certain areas of technical AI safety work, copied from the Alignment Newsletter.
Request for proposals for projects in AI alignment that work with deep learning systems(Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants.
Rohin’s opinion: Overall, I like these four directions and am excited to see what comes out of them! I’ll comment on specific directions below.
RFP: Measuring and forecasting risks(Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement?
1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc).
2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we’d much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87).
3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights.
4. Soundness and quality: This is a general category for things like “do we know that the signal isn’t overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”.
What sorts of things might you measure?
1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become?
2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor.
3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data?
4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer?
Rohin’s opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don’t think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I’m thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.)
You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above.
Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments.
1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with.
2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can.
3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know.
RFP: Interpretability(Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to “reverse engineer” trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models.
This RFP is primarily focused on an aspirational “intermediate” goal: to fully reverse engineer some modern neural network, such as an ImageNet classifier. (Despite the ambition, it is only an “intermediate” goal because what we would eventually need is a general method for cheaply reverse engineering any neural network.) The proposed areas of research are primarily inspired by the Circuits line of work (AN #142):
1. Discovering Features and Circuits: This is the most obvious approach to the aspirational goal. We simply “turn the crank” using existing tools to study new features and circuits, and this fairly often yields an interesting result that makes progress towards reverse engineering a neural network.
2. Scaling Circuits to Larger Models: So far the largest example of reverse engineering is curve circuits, with 50K parameters. Can we find examples of structure in the neural networks that allow us to drastically reduce the amount of effort required per parameter? (As examples, see equivariance and branch specialization.)
3. Resolving Polysemanticity: One of the core building blocks of the circuits approach is to identify a neuron with a concept, so that connections between neurons can be analyzed as connections between concepts. Unfortunately, some neurons are polysemantic, that is, they encode multiple different concepts. This greatly complicates analysis of the connections and circuits between these neurons. How can we deal with this potential obstacle?
Rohin’s opinion: The full RFP has many, many more points about these topics; it’s 8 pages of remarkably information-dense yet readable prose. If you’re at all interested in mechanistic interpretability, I recommend reading it in full.
This RFP also has the benefit of having the most obvious pathway to impact: if we understand what algorithm neural networks are running, there’s a much better chance that we can catch any problems that arise, especially ones in which the neural network is deliberately optimizing against us. It’s one of the few areas where nearly everyone agrees that further progress is especially valuable.
RFP: Truthful and honest AI(Owain Evans) (summarized by Rohin): This RFP outlines research projects on Truthful AI (summarized below). They fall under three main categories:
1. Increasing clarity about “truthfulness” and “honesty”. While there are some tentative definitions of these concepts, there is still more precision to be had: for example, how do we deal with statements with ambiguous meanings, or ones involving figurative language? What is the appropriate standard for robustly truthful AI? It seems too strong to require the AI system to never generate a false statement; for example it might misunderstand the meaning of a newly coined piece of jargon.
2. Creating benchmarks and tasks for Truthful AI, such as TruthfulQA (AN #165), which checks for imitative falsehoods. This is not just meant to create a metric to improve on; it may also simply perform as a measurement. For example, we could experimentally evaluate whether honesty generalizes (AN #158), or explore how much truthfulness is reduced when adding in a task-specific objective.
3. Improving the truthfulness of models, for example by finetuning models on curated datasets of truthful utterances, finetuning on human feedback, using debate (AN #5), etc.
Besides the societal benefits from truthful AI, building truthful AI systems can also help with AI alignment:
1. A truthful AI system can be used to supervise its own actions, by asking it whether its selected action was good.
2. A robustly truthful AI system could continue to do this after deployment, allowing for ongoing monitoring of the AI system.
3. Similarly, we could have a robustly truthful AI system supervise its own actions in hypothetical scenarios, to make it more robustly aligned.
Rohin’s opinion: While I agree that making AI systems truthful would then enable many alignment strategies, I’m actually more interested in the methods by which we make AI systems truthful. Many of the ideas suggested in the RFP are ones that would apply to alignment more generally and aren’t particularly specific to truthful AI. So it seems like whatever techniques we used to build truthful AI could then be repurposed for alignment. In other words, I expect that the benefit to AI alignment of working on truthful AI is that it serves as a good test case for methods that aim to impose constraints upon an AI system. In this sense, it is a more challenging, larger version of the ”never describe someone getting injured” challenge (AN #166). Note that I am only talking about how this helps AI alignment; there are also beneficial effects on society from pursuing truthful AI that I haven’t talked about here.
Here’s more info on Open Phil’s 4 requests for proposals for certain areas of technical AI safety work, copied from the Alignment Newsletter.