I’m sympathetic to the general thrust of the argument, that we should be reasonably optimistic about “business-as-usual” leading to successful narrow alignment. I put particular weight on the second argument, that the AI research community will identify and be successful at solving these problems.
However, you mostly lost me in the third argument. You suggest using whatever state-of-the-art general purpose learning technique exists to model human values, and then optimise them. I’m pessimistic about this since it involves an adversarial relationship between the optimiser (e.g. an RL algorithm) and the learned reward function. This will work if the optimiser is weak and the reward model is strong. But if we are hypothesising a far improved reward learning technique, we should also assume far more powerful RL algorithms than we have today.
Currently, it seems like RL is generally an easier problem than learning a reward function. For example, current IRL algorithms will overfit the reward function to demonstrations in a high-dimensional environment. If you later optimize the reward with an RL algorithm, you get a policy which does well under the learned reward function, but terribly (often worse than random) on the ground truth reward function. This is why you normally learn the policy jointly with the reward in a GAN-based approach. Regularizers to learn a good reward model (which can then generalize) is in active area of research, see e.g. the variational discriminator bottleneck. However, solving it in generality seems very hard. There’s been little success in adversarial defences, which is a related problem, and there are theoretical reasons to believe adversarial examples will be present for any model class in high-dimensional environments.
Overall, I’m optimistic about the research community solving these problems, but think that present techniques are far from adequate. Although improved general-purpose learning techniques will be important, I believe there will also need to be a concerted focus on solving alignment-related problems.
Based on Toggl time tracking, I spent 45 hours on the process, including talking to the organisations and preparing the report.
All of the funding was unrestricted, although I had discussions with each organisation about their strategy and provided feedback. The organisations supported are all small, and I expect most of the upside to come from allowing them to demonstrate their worth and grow in the future, so I’d prefer not to constrain their plans.
I had anticipated my donations being somewhat more skewed towards a single organisation, but had always intended to make some grant to all organisations that I felt were promising after conducting an in-depth investigation. In particular, the organisations involved invested significant time (conversations with me, collating relevant data, feedback on this report), and I believe this should be rewarded, similar to the rationale behind GiveWell’s participation grants. I also think it’s likely that much of the money moved as a result of my research will be from third-parties influenced by this post, and I feel my recommendations are more credible if I put “my money where my mouth is”.
There aren’t many people with PhD-level research experience in relevant fields who are focusing on AI safety, so I think it’s a bit early to conclude these skills are “extremely rare” amongst qualified individuals.
AI safety research spans a broad range of areas, but for the more ML-oriented research the skills are, unsurprisingly, not that different from other fields of ML research. There are two main differences I’ve noticed:
In AI safety you often have to turn ill-defined, messy intuitions into formal problem statements before you can start working on them. In other areas of AI, people are more likely to have already formalized the problem for you.
It’s important to be your own harshest critic. This is cultivated in some other fields, such as computer security and (in a different way) in mathematics. But ML tends to encourage a sloppy attitude here.
Both of these I think are fairly easily measurable from looking at someone’s past work and talking to them, though.
Identifying highly capable individuals is indeed hard, but I don’t think this is any more of a problem in AI safety research than in other fields. I’ve been involved in screening in two different industries (financial trading and, more recently, AI research). In both cases there’s always been a lot of guesswork involved, and I don’t get the impression it’s any better in other sectors. If anything I’ve found screening in AI easier: at least you can actually read the person’s work, rather than everything behind behind an NDA (common in many industries).
Upvoted because this is an important topic I’ve seen little discussion of. Although you take pains to draw attention to the limitations of this data set, these caveats aren’t included in the conclusion, so I’d be wary of anyone acting on this verbatim. I’d be interested in seeing drop out rates in other social movements to give a better idea of the base rate.
Thanks for running this! It’s unfortunate this is at the same time as ICML/IJCAI/AAMAS, I’d have been interested in attending otherwise. Not sure what proportion of your target audience go to the major ML conferences, but might be worth trying to schedule around them for next year.