What is the role of Bayesian ML for AI alignment/​safety?

I started a Ph.D. in Bayesian ML because I thought it was a relevant approach to AI safety. Currently, I think that this is unlikely and other paths within ML are more promising.

The main purpose of this post is to spark a discussion, get feedback and collect promising paths within Bayesian ML. Before that, I want to give a very short overview of why I changed my mind.

If there are relevant projects in the space of Bayesian ML for AI safety I would be very keen on knowing them. Please share them in the comments.

For this post, I’ll call everything Bayesian ML that uses Bayes theorem in the context of ML or tries to estimate probability distributions rather than point estimates.

Why I changed my mind

I was bullish on Bayesian ML for AI safety because:

  1. The Bayesian framework is very powerful in general. It can be used to describe a lot of phenomena in all kinds of disciplines, e.g. neuroscience, economics, medicine, … . Thus, I thought that such a general framework might be scaled up to something AGI-like and I should understand it better.

  2. Quantified uncertainty seemed relevant for AI alignment. This might be through quantifying what a model doesn’t know or implementing constraints, e.g. through choosing specific priors.

I changed my mind because:

  1. It currently looks like TAI will come from really large NNs. Making NNs Bayesian is usually quite computationally expensive, especially for large models. Furthermore, it looks like the current Bayesian techniques are still quite flawed, i.e. the quality of their uncertainty isn’t that great—at least that’s the vibe I’m getting after working with it for about a year.

  2. As a consequence, when relevant changes in AI come around, the Bayesian version is usually years behind. This is way too long for safety-critical applications and thus not very useful.

  3. I’m not sure quantified uncertainty matters that much for alignment. Assume you have an AGI that is unaligned and you can quantify that in some way. This still doesn’t solve the underlying problem that the AGI is not aligned.

I still think that Bayesian ML might be really useful in other aspects and I hope that Bayesian techniques will be increasingly used for statistical analysis in economics, neuroscience, medicine, etc. I’m just not sure it matters for alignment.

Possibly relevant projects

This is a shortlist from the top of my head. I’m not an expert on most of these topics, so I might misrepresent them. Feel free to correct me and add further ones.

“Knowing what we don’t know” & Out-of-distribution detection

There are a ton of papers that use Bayesian techniques to quantify predictive uncertainty in neural networks. One of my Ph.D. projects is focused on such a technique and my impression is that they work OKish—at least in classification settings. However, many simple non-Bayesian techniques such as ensembles yield results of similar or even better quality. Thus, I’m not sure if the Bayesian formalism adds anything of value while it is usually much more expensive.

Reward uncertainty

In the context of RL, the reward is usually given as a point estimate. Using distributions for the reward instead of point estimates might make the agent more robust and thus less prone to errors. Possibly this is a path to reduce misalignment.

Learning from human preferences

To align AI systems, it is important to efficiently give them feedback on their actions or beliefs. Many current systems that implement learning from human preferences either are conceptually Bayesian or use probabilistic models such as Gaussian Processes.

Constraining models

In the Bayesian framework, priors incorporate previous knowledge. We can model safety constraints as strong priors, e.g. punishing the model heavily for crossing a certain threshold. While this is nice in theory, it still doesn’t solve the problem of mapping the real world into mathematical constraints, i.e. we still need to map “don’t do the bad thing” into a probability distribution.

Causality

Some people believe that causal modeling is the missing piece for AGI. Most causal modeling approaches use techniques that are related to Bayesian ML. I’m not sure yet how realistic causal modeling is but I think there are some insights that the AI safety community is currently overlooking. A post on this topic is in the making and will be published soon.

I need your help

I’m trying to pivot to projects more related to AI safety within my Ph.D. For this, I want to better understand which kinds of things are relevant. If you have ideas, papers, blog posts, etc. that could be helpful don’t hesitate to comment.