I am a mathematics grad student. I think that working on AI safety research would be a valuable thing for me to do, *if* the research were something I felt intellectually motivated by. Unfortunately, whether I feel intellectually motivated by a problem has little to do with what is useful or important; it basically just depends on how cool/aesthetic/elegant the math involved is.

I’ve taken a semester of ML and read a handful (~5) AI safety papers as part of a Zoom reading group, and thus far none of it appeals. It might be that this is because nothing in AI research will be adequately appealing, but it might also be that I just haven’t found the right topic yet. So to that end: what’s the coolest math involved in AI safety research? What problems might I really like reading about or working on?

# [Question] What are the coolest topics in AI safety, to a hopelessly pure mathematician?

- How to pursue a career in technical AI alignment by 4 Jun 2022 21:36 UTC; 229 points) (
- Reshaping the AI Industry by 29 May 2022 22:54 UTC; 143 points) (LessWrong;
- How to pursue a career in technical AI alignment by 4 Jun 2022 21:11 UTC; 63 points) (LessWrong;
- Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment? by 8 Jun 2022 22:26 UTC; 52 points) (LessWrong;
- Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment? by 8 Jun 2022 22:26 UTC; 52 points) (LessWrong;
- Why don’t you introduce really impressive people you personally know to AI alignment (more often)? by 11 Jun 2022 15:59 UTC; 33 points) (LessWrong;
- List of links for getting into AI safety by 4 Jan 2023 19:45 UTC; 5 points) (LessWrong;
- 19 May 2022 22:10 UTC; 4 points) 's comment on How to get into AI safety research by (LessWrong;

What are examples of things you find cool/aesthetic/elegant?

My favorite fields of math are abstract algebra, algebraic topology, graph theory, and computational complexity. The latter two are my current research fields. This may seem to contradict my claim of being a pure mathematician, but I think my natural approach to research is a pure mathematician’s approach, and I have on many occasions jokingly lamented the fact that TCS is in the CS department, instead of in the math department where it belongs. (This joke is meant as a statement about my own preferences, not a claim about how the world should be.)

Some examples of specific topics I’ve found particularly fun to explain to people: the halting problem, P vs NP and the idea of poly-time reductions, Kempe’s false proof of the four-color theorem, the basics of group theory.I’m guessing you’ve already made up your mind on this since it’s been a few months, but since you mentioned computational complexity being your research field you might be interested to know that Scott Aaronson was persuaded by Jan Leike to spend a year at OpenAI to

… think about the theoretical foundations of AI safety and alignment. What, if anything, can computational complexity contribute to a principled understanding of how to get an AI to do what we want and not do what we don’t want?

(Scott admitted, like you, that he basically needed to be nerd-sniped into working on problems; “this is very important so you must work on it” doesn’t work in practice.)

Quoting Scott a bit more (and adding bullets):

So, what projects will I actually work on at OpenAI? Yeah, I’ve been spending the past week trying to figure that out. I still don’t know, but a few possibilities have emerged.

First, I might work out a general theory of sample complexity and so forth for learning in dangerous environments—i.e., learning where making the wrong query might kill you.

Second, I might work on explainability and interpretability for machine learning: given a deep network that produced a particular output, what do we even mean by an “explanation” for “why” it produced that output? What can we say about the computational complexity of finding that explanation?

Third, I might work on the ability of weaker agents to verify the behavior of stronger ones. Of course, if P≠NP, then the gap between the difficulty of solving a problem and the difficulty of recognizing a solution can sometimes be enormous. And indeed, even in empirical machine learing, there’s typically a gap between the difficulty of

*generating*objects (say, cat pictures) and the difficulty of*discriminating*between them and other objects, the latter being easier. But this gap typically isn’t exponential, as is conjectured for NP-complete problems: it’s much smaller than that. And counterintuitively, we can then turn around and use the generators to improve the discriminators. How can we understand this abstractly? Are there model scenarios in complexity theory where we can prove that something similar happens? How far can we amplify the generator/discriminator gap—for example, by using interactive protocols, or debates between competing AIs?

That said, these mostly lean towards theory-builders, and you mentioned upthread being more problem-solver-oriented, so they probably aren’t as interesting.

(Posting as a comment since I’m not really answering your actual question.)

I think

*if*you find something within AI safety that is intellectually motivating for you, this will more likely than not be your highest-impact option. But FWIW here are some pieces that are mathy in one way or another that in my view still represent valuable work by impact criteria (in no particular order):Report on Semi-informative Priors [relevant to AI timelines]

Racing to the precipice: a model of artificial intelligence development

Why we can’t take expected value estimates literally (even when they’re unbiased)

Philanthropy Timing and the Hinge of History (and other work on this by Phil Trammell)

The Simulation Hypothesis Undercuts the SIA/Great Filter Doomsday Argument

Statistical Normalization Methods in Interpersonal and Intertheoretic Comparisons

Absolutely agree with everything you’ve said here! AI safety is by no means the only math-y impactful work.

Most of these don’t quite feel like what I’m looking for, in that the math is being used to do something useful or valuable but the math itself isn’t very pretty. “Racing to the Precipice” looks closest to being the kind of thing I enjoy.

Thank you for the suggestions!

Love this question! I too would identify as a hopelessly pure mathematician (I’m currently working on a master’s thesis in category theory), and I too spent some time trying to relate my academic interests to AI safety. I didn’t have much success; in particular, nothing ML-related ever appealed. I hope it works out better for you!

You might be interested in this paper on ‘Backprop as Functor’.

(I’m personally not compelled by the safety case for such work, but YMMV, and I think I know at least a few people who are more optimistic.)

Question: would an impactful but not cool/popular/elegant topic interest you? What’s your balance between coolness and impactfulness?

I am not intellectually motivated by things on the basis of their impactfulness. If I were, I wouldn’t need to ask this question.

Elaborating on this, thanks to Spencer Becker-Kahn for prompting me to think about this more:

From a standpoint of my values and what I think is good, I’m an EA. But doing intellectual work, specifically, takes more than just my moral values. I can’t work on problems I don’t think are cool. I mean, I have, and I did, during undergrad, but it was a huge relief to be done with it after I finished my quals and I have zero desire to go back to it. It would be—at minimum unsustainable—for me to try to work on a problem where my main motivation for doing it is “it would be morally good for me to solve this.” I struggle a bit with motivation at the best of times, or rather, on the best of problems. So, if I can find something in AI safety that I think is approximately as cool as what I’m currently doing, I’ll do it, but the coolness is actually a requirement, because I won’t be successful or happy otherwise. I’m not built for it (and I think most EAs aren’t; fortunately some of them have different tastes than I do, as to what is or isn’t cool).

By Scott Garrabrant et al:

Logical induction

Cartesian frames

Finite factored sets

By John Wentworth

By myself:

Infra-Bayesianism (collaboration with Alexander Appel)

In particular, infra-Bayesian physicalism

RL with imperceptible rewards

Quantilization (building on work by Jessica Taylor)

Forecasting using incomplete models (related to logical induction and infra-Bayesianism)

Various topics that don’t have sufficiently good articles yet, such as 1 2 3

These are the sort of thing I’m looking for! In that, on first glance, they’re a lot of solid “maybe”s where mostly I’ve been finding “no”s. So that’s encouraging—thank you so much for the suggestions!

(Told this to Jenny in person, but posting for the benefit of others)

AI safety is a young, pre-paradigmatic area of research without a universally accepted mathematical formalism, so if you’re after cool math, my suggestion is to learn the basics of one or two well-established fields that are mathematically mature and have a decent chance of being relevant to AI safety.

In particular, I think Learning Theory and Causality are areas with plenty of Aesthetic Math™.

## Learning theory

Statistical learning theoryis the mathematical study of inductive reasoning—how can we make generalizations from past observations to future observations? It’s an entire mathematically rich field devoted to formalizing Occam’s razor.Computational learning theoryimposes the further restriction that learning algorithms be computationally efficient. It has rich connections to other parts of theoretical computer science (for example, there is a duality between computational learning theory and cryptography—positive results for one translate to negative results for the other!) And there are many fun problems of a combinatorial puzzle flavor.Most of learning theory assumes that observations are drawn i.i.d. from a distribution.

Online Learningasks what happens if we eliminate this assumption. Incredibly, it can be shown that inductive reasoning can be successfuleven when observations are handcrafted by an adversary. The key is to measure success in relative rather than absolute terms: how did you perform in comparison to the best member of a pre-specified class of predictors? There are beautiful connections to convex analysis.Readings:

An Introduction to Computational Learning Theory, by Kearns & Vazirani. A well-written, approachable introductory textbook with fun (to me) exercises. It’s somewhat outdated, but still the best learning theory text for a beginner imo.

Understanding Machine Learning: From Theory to Algorithms, by Shalev-Shwartz & Ben-David. A more up-to-date and comprehensive textbook on learning theory.

Online Learning & Online Convex Optimization by Shalev-Shwartz. An excellent unified treatment of the mathematics of online learning.

## Causality

I don’t know this area as well, but the material I have learned has been mathematically beautiful. In particular, I suggest learning about Judea Pearl’s theory of causality, which has been very influential in computer science, statistics, and some of the natural and social sciences. (There are a few competing formalisms for causality, but Pearl’s is the most mathematically beautiful as far as I can tell.) Pearl’s theory generalizes the classical theory of probability to allow for reasoning about cause and effect, using a framework that involves manipulations of directed acyclic graphs.

Reading: Causality, by Pearl.

A handful of ideas (things that tickle my aesthetic) from an ex-topologist:

https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign

https://www.lesswrong.com/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy/

https://ai-alignment.com/corrigibility-3039e668638 (and other things from https://ai-alignment.com/ )

The second and third strike me as useful ideas and kind of conceptually cool, but not terribly math-y; rather than feeling like these are interesting math problems, the math feels almost like an afterthought. (I’ve read a little about corrigibility before, and had the same feeling then.) The first is the coolest, but also seems like the least practical—doing math about weird simulation thought experiments is fun but I don’t personally expect it to come to much use.

Thank you for sharing all of these! I sincerely appreciate the help collecting data about how existing AI work does or doesn’t mesh with my particular sensibilities.

To me they feel like pre-formal math? Like the discussion of corrigibility gives me a tingly sense of “there’s what on the surface looks like an interesting concept here, and now the math-y question is whether one can formulate definitions which capture that and give something worth exploring”.

(I definitely identify more with the “theory builder” of Gower’s two cultures.)

Ah, that’s a good way of putting it! I’m much more of a “problem solver.”

Cool!

My opinionated takes for problem solvers:

(1) Over time we’ll predictably move in the direction from “need theory builders” to “need problem solvers”, so even if you look around now and can’t find anything, it might be worth checking back every now and again.

(2) I’d look at ELK now for sure, as one of the best and further-in-this-direction things.

(3) Actually many things have at least some interesting problems to solve as you get deep enough. Like I expect curricula teaching ML to very much not do this, but if you have mastery of ML and are trying to achieve new things with it, much more of the interesting-problems-to-solve to come up. Unfortunately I don’t know how to predict how much of the itch this will address for you … maybe one question is how much do you find satisfaction in solving problems outside of pure mathematics? (e.g. logic puzzles, but also things in other domains of life)

The point about checking back in every now and then is a good one; I had been thinking in more binary terms and it’s helpful to be reminded that “not yet, maybe later” is also a possible answer to whether to do AI safety research.

I like logic puzzles, and I like programming insofar as it’s like logic puzzles. I’m not particularly interested in e.g. economics or physics or philosophy. My preferred type of problem is very clear-cut and abstract, in the sense of being solvable without reference to how the real world works. More “is there an algorithm with time complexity Y that solves math problem X” than “is there a way to formalize real-world problem X into a math problem for which one might design an algorithm.” Unfortunately AI safety seems to be a lot of the latter!

(Terry Tao’s distinction between ‘pre-rigorous’, ‘rigorous’, and ‘post-rigorous’ maths might also be relevant.)

Maybe the notes on ‘ascription universality’ on ai-alignment.com are a better match for your sensibilities.

Not sure what mathematically interests you, but you should probably check out Vanessa Kosoy’s learning-theoretic research agenda (she is hiring mathematicians!). Also, the Topos Institute are doing many interesting things in AI safety and other things (I’m personally particularly interested in their compositionality/modeling work, which seems very cool to me).

A couple of unasked-for pieces of advice that may be relevant (would be for my past self who was sort of in a similar position):

Sadly, many times we should expect tradeoffs between impact and interest, where to actually implement innovations requires doing hard manual work. Especially in academic fields, where the impactful uninteresting work is more neglected.

Our interests change quite a bit over time, and it’s usually hard to predict how it might change. That said, for many people they find stuff more interesting the more competence they feel at it and the more they care about the problem they try to solve or about the product they intend to deliver.

Your points (1) and (2) are ones I know all too well, though it was quite reasonable to point them out in case I didn’t, and they may yet prove helpful to other readers of this post.

Regarding Vanessa Kosoy’s work, I think I need to know more math to follow it (specifically learning theory, says Ben; for the benefit of those unlucky readers who are not married to him, he wrote his answer in more detail below). I did find myself enjoying reading what parts of the post I could follow, at least.

Regarding the Topos Institute, someone I trust has a low opinion of them; epistemic status secondhand and I don’t know the details (though I intend to ask about it).

Thanks very much for the suggestions!

I’m did a pure maths undergrad and recently switched to doing mechanistic interpretability work—my day job isn’t exactly doing maths, but I find it has a strong aesthetic appeal in a similar way. My job is not to train an ML model (with all the mess and frustration that involves), it’s to take a model someone else has trained, and try to rigorously understand what is going on with it. I want to take some behaviour I know it’s capable of and understand how it does that, and ideally try to decompile the operations it’s running into something human understandable. And, fundamentally, a neural network is just a stack of matrix multiplications. So I’m trying to build tools and lenses for analysing this stack of matrices, and converting it into something understandable. Day-to-day, this looks like having ideas for experiments, writing code and running them, getting feedback and iterating, but I’ve found a handful of times where having good intuitions around linear algebra, or how gradients work, and spending some time working through algebra has been really useful and clarifying.

If you’re interested in learning more, Zoom In is a good overview of a particular agenda for mechanistic interpretability in vision models (which I personally find super inspiring!), and my team wrote a pretty mathsy paper giving a framework to breakdown and understand small, attention-only transformers (I expect the paper to only make sense after reading an overview of autoregressive transformers like this one). If you’re interested in working on this, there are currently teams at Anthropic, Redwood Research, DeepMind and Conjecture doing work along these lines!

Thanks very much for the suggestions, I appreciate it a lot! Zoom In was a fun read—not very math-y but pretty cool anyway. The Transformers paper also seems kind of fun. I’m not really sure whether it’s math-y enough for me to be interested in it qua math...but in any event it was fun to read about, which is a good sign. I guess “degree of mathiness” is only one neuron of several neurons sending signals to the “coolness” layer, if I may misuse metaphors.

Here are some of mine to add to Vanessa’s list.

One on imitation learning. [Currently an “accept with minor revisions” at JMLR]

One on conservatism in RL. A special case of Vanessa’s infra-Bayesianism. [COLT 2020]

One on containment and myopia. [IEEE]

Some mathy AI safety pieces or other related material off the top of my head (in no particular order, and definitely not comprehensive nor weighted toward impact or influence):

The Speed + Simplicity Prior is probably anti-deceptive

Prediction can be Outer Aligned at Optimum

Reinforcement Learning in Newcomblike Environments

Commitment games with conditional information revelation

Chris Olah’s older pieces on neural networks (under ‘Neural Networks (General)’ and below)

In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes. If we model the human as a (choosable) polynomial time algorithm, then

1. Debate and amplification get to PSPACE, and more generally n-step debate gets to ΣnP.

2. Cross-examination gets to NEXP.

3. If one allows opaque pointers, there are schemes that go further: market making gets to R.

Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues. In particular, interactive proofs schemes are only interesting if they relativize: we also have IP=PSPACE and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle IPO<PSPACEO. My sense is there are further obstacles that can be found: my intuition is that “market making = R” isn’t the right theorem once obstacles are taken into account, but don’t have a formalized model of this intuition.

The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.

I think there’s quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn’t of the “solve a fully formalized problem” sort. Two examples:

1. As mentioned above, I find “market making = R” unlikely to the right result. But this doesn’t mean that market making isn’t an interesting scheme: there are connections between market making and Paul Christiano’s learning the prior scheme. As previously formalized, market making misses a practical limitation on the available human data (the n-way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.

2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to n-player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win. But the “polynomial time judge” model can’t model this situation, as there is no variance when trying to convince a deterministic algorithm. As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.

You might be interested in this great intro sequence to embedded agency. There’s also corrigibility and MIRI’s other work on agent foundations.

Also, coherence arguments and consequentialist cognition.

AI safety is a young field; for most open problems we don’t yet know of a way to crisply state them in a way that can be resolved mathematically. So if you enjoy taking messy questions and turning them into neat math you’ll probably find much to work on.

ETA: oh and of course ELK.