Crossposted from the Alignment Forum and Less Wrong

Introduction

Imagine you are tasked with curing a disease which hasn’t appeared yet. Setting aside why you would know about such a disease’s emergence in the future, how would you go about curing it? You can’t directly or indirectly gather data about the disease itself, as it doesn’t exist yet; you can’t try new drugs to see if they work, as the disease doesn’t exist yet; you can’t do anything that is experimental in any way on the disease itself, as… you get the gist. You would be forgiven for thinking that there is nothing you can do about it right now.

AI Alignment looks even more hopeless: it’s about solving a never-before seen problem on a technology which doesn’t exist yet.

Yet researchers are actually working on it! There are papers, books, unconferences, whole online forums full of alignment research, both conceptual and applied. Some of these work in a purely abstract and theoretical realm, others study our best current analogous of Human Level AI/Transformative AI/AGI, still others iterate on current technologies that seem precursors to these kinds of AI, typically large language models. With so many different approaches, how can we know if we’re making progress or just grasping at straws?

Intuitively, the latter two approaches sound more like how we should produce knowledge: they take their epistemic strategies (ways of producing knowledge) out of Science and Engineering, the two cornerstones of knowledge and technology in the modern world. Yet recall that in alignment, models of the actual problem and/or technology can’t be evaluated experimentally, and one cannot try and iterate on proposed solutions directly. So when we take inspiration from Science and Engineering (and I want people to do that), we must be careful and realize that most of the guarantees and checks we associate with both are simply not present in alignment, for the decidedly pedestrian reason that Human Level AI/Transformative AI/AGI doesn’t exist yet.

I thus claim that:

Epistemic strategies from Science and Engineering don’t dominate other strategies in alignment research.
Given the hardness of grounding knowledge in alignment, we should leverage every epistemic strategy we can find and get away with.
These epistemic strategies should be made explicit and examined. Both the ones taken or adapted from Science and Engineering, and every other one (for example the more theoretical and philosophical strategies favored in conceptual alignment research).

This matters a lot, as it underlies many issues and confusions in how alignment is discussed, taught, created and criticized. Having such a diverse array of epistemic strategies is fascinating, but their implicit nature makes it challenging to communicate with newcomers, outsiders, and even fellow researchers leveraging different strategies. Here is a non-exhaustive list of issues that boil down to epistemic strategy confusion:

There is a strong natural bias towards believing that taking epistemic strategies from Science and Engineering automatically leads to work that is valuable for alignment.
- This leads to a flurry of work (often by well-meaning ML researchers/engineers) that doesn’t tackle or help with alignment, and might push capabilities (in an imagined tradeoff with the assumed alignment benefits of the work)
- Very important: that doesn’t mean I consider work based on ML as useless for alignment. Just that the sort of ML-based work that actually tries to solve the problem tends to be by people who understand that one must be careful when transfering epistemic strategies to alignment.
There is a strong natural bias towards disparaging any epistemic strategy that doesn’t align neatly with the main strategies of Science and Engineering (even when the alternative epistemic strategies are actually used by scientists and engineers in practice!)
- This leads to more and more common confusion about what’s happening on the Alignment Forum and conceptual alignment research more generally. And that confusion can easily turn into the sort of criticism that boils down to “this is stupid and people should stop doing that”.
It’s quite hard to evaluate whether alignment research (applied or conceptual) creates the kind of knowledge we’re after, and helps move the ball forward. This comes from both the varieties of epistemic strategies and the lack of the usual guarantees and checks when applying more mainstream ones (every observation and experiment is used through analogies or induction to future regimes).
- This makes it harder for researchers using different sets of epistemic strategies to talk to each other and give useful feedback to one another.
Criticism of some approach or idea often stops at the level of the epistemic strategy being weird.
- This happens a lot with criticism of lack of concreteness and formalism and grounding for Alignment Forum posts.
- It also happens when applied alignment research is rebuked solely because it uses current technology, and the critics have decided that it can’t apply to AGI-like regimes.
Teaching alignment without tackling this pluralism of epistemic strategies, or by trying to fit everything into a paradigm using only a handful of those, results in my experience in people who know the lingo and some of the concepts, but have trouble contributing, criticizing and teaching the ideas and research they learnt.
- You can also end up with a sort of dogmatism that alignment can only be done a certain way.

Note that most fields (including many sciences and engineering disciplines) also use weirder epistemic strategies. So do many attempts at predicting and directing the future (think existential risk mitigation in general). My point is not that alignment is a special snowflake, more that it’s both weird enough (in the inability to experiment and iterate directly with) and important enough that elucidating the epistemic strategies we’re using, finding others and integrating them is particularly important.

In the rest of this post, I develop and unfold my arguments in more detail. I start with digging deeper into what I mean by the main epistemic strategies of Science and Engineering, and why they don’t transfer unharmed to alignment. Then I demonstrate the importance of looking at different epistemic strategies, by focusing on examples of alignment results and arguments which make most sense as interpreted through the epistemic lens of Theoretical Computer Science. I use the latter as an inspiration because I adore that field and because it’s a fertile ground for epistemic strategies. I conclude by pointing at the sort of epistemic analyses I feel are needed right now.

Lastly, this post can be seen as a research agenda of sorts, as I’m already doing some of these epistemic analyses, and believe this is the most important use of my time and my nerdiness about weird epistemological tricks. We roll with what evolution’s dice gave us.

Thanks to Logan Smith, Richard Ngo, Remmelt Ellen,, Alex Turner, Joe Collman, Ruby, Antonin Broi, Antoine de Scorraille, Florent Berthet, Maxime Riché, Steve Byrnes, John Wentworth and Connor Leahy for discussions and feedback on drafts.

Science and Engineering Walking on Eggshells

What is the standard way of learning how the world works? For centuries now, the answer has been Science.

I like Peter Godfrey-Smith’s description in his glorious Theory and Reality:

Science works by taking theoretical ideas and trying to find ways to expose them to observation. The scientific strategy is to construe ideas, to embed them in surrounding conceptual frameworks, and to develop them, in such a way that this exposure is possible even in the case of the most general and ambitious hypotheses about the universe.

That is, the essence of science is in the trinity of modelling, predicting and testing.

This doesn’t mean there are no epistemological subtleties left in modern science; finding ways of gathering evidence, of forcing the meeting of model and reality, often takes incredibly creative turns. Fundamental Chemistry uses synthesis of molecules never seen in nature to test the edge cases of its models; black holes are not observable directly, but must be inferred by a host of indirect signals like the light released by matter falling in the black hole or gravitational waves during black holes merging; Ancient Rome is probed and explored through a discussion between textual analysis and archeological discoveries.

Yet all of these still amount to instantiating the meta epistemic strategy “say something about the world, then check if the world agrees”. As already pointed out, it doesn’t transfer straightforwardly to alignment because Human Level AI/Transformative AI/AGI doesn’t exist yet.

What I’m not saying is that epistemic strategies from Science are irrelevant to alignment. But because they must be adapted to tell us something about a phenomenon that doesn’t exist yet, they lose their supremacy in the creation of knowledge. They can help us to gather data about what exists now, and to think about the sort of models that are good at describing reality, but checking their relevance to the actual problem/thinking through the analogies requires thinking in more detail about what kind of knowledge we’re creating.

If we instead want more of a problem solving perspective, tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.

Once again, this is made qualitatively different in alignment by the fact that neither the problem nor the source of the problem exist yet. We can try to solve toy versions of the problem, or what we consider analogous situations, but none of our solutions can be actually tested yet. And there is the additional difficulty that Human-level AI/Transformative AI/AGI might be so dangerous that we have only one chance to implement the solution.

So if we want to apply the essence of human technological progress, from agriculture to planes and computer programs, just trying things out, we need to deal with the epistemic subtleties and limits of analogies and toy problems.

An earlier draft presented the conclusion of this section as “Science and Engineering can’t do anything for us—what should we do?” which is not my point. What I’m claiming is that in alignment, the epistemic strategies from both Science and Engineering are not as straightforward to use and leverage as they usually are (yes, I know, there’s a lot of subtleties to Science and Engineering anyway). They don’t provide privileged approaches demanding minimal epistemic thinking in most cases; instead we have to be careful how we use them as epistemic strategies. Think of them as tools which are so good that most of the time, people can use them without thinking about all the details of how the tools work, and get mostly the wanted result. My claim here is that these tools need to be applied with significantly more care in alignment, where they lose their “basically works all the time” status.

Acknowledging that point is crucial for understanding why alignment research is so pluralistic in terms of epistemic strategies. Because no such strategy works as broadly as we’re used to for most of Science and Engineering, alignment has to draw from literally every epistemic strategy it can pull, taking inspiration from Science and Engineering, but also philosophy, pre-mortems of complex projects, and a number of other fields.

To show that further, I turn to some main alignment concepts which are often considered confusing and weird, in part because they don’t leverage the most common epistemic strategies. I explain these results by recontextualizing them through the lens of Theoretical Computer Science (TCS).

Epistemic Strategies from TCS

For those who don’t know the field, Theoretical Computer Science focuses on studying computation. It emerged from the work of Turing, Church, Gödel and others in the 30s, on formal models of what we would now call computation: the process of following a step-by-step recipe to solve a problem. TCS cares about what is computable and what isn’t, as well as how much resources are needed for each problem. Probably the most active subfield is Complexity Theory, which cares about how to separate computational problems in classes capturing how many resources (most often time – number of steps) are required for solving them.

What makes TCS most relevant to us is that theoretical computer scientists excel at wringing knowledge out of the most improbable places. They are brilliant at inventing epistemic strategies, and remember, we need every one we can find for solving alignment.

To show you what I mean, let’s look at three main ideas/arguments in alignment (Convergent Subgoals, Goodhart’s Law and the Orthogonality Thesis) through some TCS epistemological strategies.

Convergent Subgoals and Smoothed Analysis

One of the main argument for AI Risk and statement of a problem in alignment is Nick Bostrom’s Instrumental Convergence Thesis (which also takes inspiration from Steve Omohundro’s Basic AI Drives):

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.

That is, actions/plans exist which help with a vast array of different tasks: self-preservation, protecting one’s own goal, acquiring resources… So a Human-level AI/Transformative AI/AGI could take them while still doing what we asked it to do. Convergent subgoals are about showing that behaviors which look like they can only emerge from rebellious robots actually can be pretty useful for obedient (but unaligned in some way) AI.

What kind of argument is that? Bostrom makes a claim about “most goals”—that is, the space of all goals. His claim is that convergent subgoals are so useful that goal-space is almost chock-full of goals incentivizing convergent subgoals.

And more recent explorations of this argument have followed this intuition: Alex Turner et al.’s work on power-seeking formalizes the instrumental convergence thesis in the setting of Markov decision processes (MDP) and reward functions by looking, for every “goal” (a distribution over reward functions) at the set of all its permuted variants (the distribution given by exchanging some states – so the reward labels stay the same, but are not put on the same states). Their main theorems state that given some symmetry properties in the environment, a majority (or a possibly bigger fraction) of the permuted variant of every goal will incentivize convergent subgoals for its optimal policies.

So this tells us that goals without convergent subgoals exist, but they tend to be crowded out by ones with such subgoals. Still, it’s very important to realize what neither Bostrom nor Turner are arguing for: they’re not saying that every goal has convergent subgoals. Nor are they claiming to have found the way humans sample goal-space, such that their results imply goals with convergent subgoals must be sampled by humans with high probability. Instead, they show the overwhelmingness of convergent subgoals in some settings, and consider that a strong indicator that avoiding them is hard.

I see a direct analogy with the wonderful idea of smoothed analysis in complexity theory. For a bit of background, complexity theory generally focuses on the worst case time taken by algorithms. That means it mostly cares about which input will take the most time, not about the average time taken over all inputs. The latter is also studied, but it’s nigh impossible to find the distribution of input actually used in practice (and some problems studied in complexity theory are never used in practice, so a meaningful distribution is even harder to make sense of). Just like it’s very hard to find the distribution from which goals are sampled in practice.

As a consequence of the focus on worst case complexity, some results in complexity theory clash with experience. Here we focus on the simplex algorithm, used for linear programming: it runs really fast and well in practice, despite having provable exponential worst case complexity. Which in complexity-theory-speak means it shouldn’t be a practical algorithm.

Daniel Spielman and Shang-Hua Teng had a brilliant intuition to resolve this inconsistency: what if the worst case inputs were so rare that just a little perturbation would make them easy again? Imagine a landscape that is mostly flat, with some very high but very steep peaks. Then if you don’t land exactly on the peaks (and they’re so pointy that it’s really hard to get there exactly), you end up on the flat surface.

This intuition yielded smoothed analysis: instead of just computing the worst case complexity, we compute the worst case complexity averaged over some noise on the input. Hence the peaks get averaged with the flatness around them and have a low smoothed time complexity.

Convergent subgoals, especially in Turner’s formulation, behave in an analogous way: think of the peaks as goals without convergent subgoals; to avoid power-seeking we would ideally reach one of them, but their rarity and intolerance to small perturbations (permutations here) makes it really hard. So the knowledge created is about the shape of the landscape, and leveraging the intuition of smoothed analysis, that tells us something important about the hardness of avoiding convergent subgoals.

Note though that there is one aspect in which Turner’s results and the smoothed analysis of the simplex algorithm are complete opposite: in the former the peaks are what we want (no convergent subgoals) while in the latter they’re what we don’t want (inputs that take exponential time to treat). This inversion doesn’t change the sort of knowledge produced, but it’s an easy source of confusion.

Now, epistemic analysis isn’t only meant for clarifying and distilling results: it can and should pave the way to some insights on how the arguments could fail. Here, the analogy to smoothed complexity and the landscape picture suggests that Bostrom and Turner’s argument could be interrogated by:

Arguing that the sampling method we use in practice to decide tasks and goals for AI targets specifically the peaks.
Arguing that the sampling is done in a smaller goal-space for which the peaks are broader
- For Turner’s version, one way of doing that might be to not consider the full orbit, but only the close-ish variations of the goals (small permutations instead of all permutations). So questioning the form of noise over the choice of goals that is used in Turner’s work

Smoothed Analysis	Convergent Subgoals (Turner et al)
Possible Inputs	Possible Goals
Worst-case inputs make steep and rare peaks	Goals without convergent subgoals make steep and rare peaks

Goodhart’s Law and Distributed Impossibility/Hardness Results

Another recurrent concept in alignment thinking is Goodhart’s law. It wasn’t invented by alignment researchers, but Scott Garrabrant and David Manheim proposed a taxonomy of its different forms. Fundamentally, Goodhart’s law tells us that if we optimize a proxy of what we really want (some measure that closely correlates with the wanted quantity in the regime we can observe), strong optimization tends to make the two split apart, meaning we don’t end up with what we really wanted. For example, imagine that everytime you go for a run you put on your running shoes, and you only put on these shoes for running. Putting on your shoes is thus a good proxy for running; but if you decide to optimize the former in order to optimize the latter, you will take most of your time putting on and taking off your shoes instead of running.

In alignment, Goodhart is used to argue for the hardness of specifying exactly what we want: small discrepancies can lead to massive differences in outcomes.

Yet there is a recurrent problem: Goodhart’s law assumes the existence of some True Objective, of which we’re taking a proxy. Even setting aside the difficulties of defining what we really want at a given time, what if what we really want is not fixed but constantly evolving? Thinking about what I want nowadays, for example, it’s different from what I wanted 10 years ago, despite some similarities. How can Goodhart’s law apply to my values and my desires if there is not a fixed target to reach?

Salvation comes from a basic insight when comparing problems: if problem A (running a marathon) is harder than problem B (running 5 km), then showing that the latter is really hard, or even impossible, transfers to the former.

My description above focuses on the notion of one problem being harder than another. TCS formalizes this notion by saying the easier problem is reducible to the harder one: a solution for the harder one lets us build a solution for the easier problem. And that’s the trick: if we show there is no solution for the easier problem, this means that there is no solution for the harder one, or such a solution could be used to solve the easier problem. Same thing with hardness results which are about how difficult it is to solve a problem.

That is, when proving impossibility/hardness, you want to focus on the easiest version of the problem for which the impossibility/hardness still holds.

In the case of Goodhart’s law, this can be used to argue that it applies to moving targets because having True Values or a True Objective makes the problem easier. Hitting a fixed target sounds simpler than hitting a moving or shifting one. If we accept that conclusion, then because Goodhart’s law shows hardness in the former case, it also does in the latter.

That being said, whether the moving target problem is indeed harder is debatable and debated. My point here is not to claim that this is definitely true, and so that Goodhart’s law necessarily applies. Instead, it’s to focus the discussion on the relative hardness of the two problems, which is what underlies the correctness of the epistemic strategy I just described. So the point of this analysis is that there is another argument to decide the usefulness of Goodhart’s law in alignment than debating the existence of True Values.

	Running	Alignment
Easier Problem	5k	Approximating a fixed target (True Values)
Harder Problem	Marathon	Approximating a moving target

Orthogonality Thesis and Complexity Barriers

My last example is Bostrom’s Orthogonality Thesis: it states that goals and competence are orthogonal, meaning that they are independent– a certain level of competence doesn’t force a system to have only a small range of goals (with some subtleties that I address below).

That might sound only too general to really be useful for alignment, but we need to put it in context. The Orthogonality Thesis is a response to a common argument for alignment-by-default: because a Human-level AI/Transformative AI/AGI would be competent, it should realize what we really meant/wanted, and correct itself as a result. Bostrom points out that there is a difference between understanding and caring. The AI understanding our real intentions doesn’t mean it must act on that knowledge, especially if it is programmed and instructed to follow our initial commands. So our advanced AI might understand that we don’t want it to follow convergent subgoals while maximizing the number of paperclips produced in the world, but what it cares about is the initial goal/command/task of maximizing paperclips, not the more accurate representation of what we really meant.

Put another way, if one wants to prove alignment-by-default, the Orthogonality thesis argues that competence is not enough. As it is used, it’s not so much a result about the real world, but a result about how we can reason about the world. It shows that one class of arguments (competence will lead to human values) isn’t enough.

Just like some of the weirdest results in complexity theory: the barriers to P vs NP. This problem is one of the biggest and most difficult open questions in complexity theory: settling formally the question of whether the class of problems which can tractably be solved (P for Polynomial time) is equal to the class of problems for which solutions can be tractably checked (NP for Non-deterministic Polynomial time). Intuitively those are different: the former is about creativity, the second about taste, and we feel that creating something of quality is harder than being able to recognize it. Yet a proof of this result (or its surprising opposite) has evaded complexity theorists for decades.

That being said, recall that theoretical computer scientists are experts at wringing knowledge out of everything, including their inability to prove something. This resulted in the three barriers to P vs NP: three techniques from complexity theory which have been proved to not be enough by themselves for showing P vs NP or its opposite. I won’t go into the technical details here, because the analogy is mostly with the goal of these barriers. They let complexity theorists know quickly if a proof has potential – it must circumvent the barriers somehow.

The Orthogonality thesis plays a similar role in alignment: it’s an easy check for the sort of arguments about alignment-by-default that many people think of when learning about the topic. If they extract alignment purely from the competence of the AI, then the Orthogonality Thesis tells us something is wrong.

What does this mean for criticism of the argument? That what matters when trying to break the Orthogonality Thesis isn’t its literal statement, but whether it still provides a barrier to alignment-by-default. Bostrom himself points out that the Orthogonality Thesis isn’t literally true in some regimes (for example some goals might require a minimum of competence) but that doesn’t affect the barrier nature of the result.

Barriers to P vs NP	Orthogonality Thesis
Proof techniques that are provably not enough to settle the question	Competence by itself isn’t enough to show alignment-by-default

Improving the Epistemic State-of-the-art

Alignment research aims at solving a never-before-seen problem caused by a technology that doesn’t exist yet. This means that the main epistemic strategies from Science and Engineering need to be adapted if used, and lose some of their guarantees and checks. In consequence, I claim we shouldn’t only focus on those, but use all epistemic strategies we can find to produce new knowledge. This is already happening to some extent, causing both progress on many fronts but also difficulties in teaching, communicating and criticizing alignment work.

In this post I focused on epistemological analyses drawing from Theoretical Computer Science, both because of my love for it and because it fits into the background of many conceptual alignment researchers. But many different research directions leverage different epistemic strategies, and those should also be studied and unearthed to facilitate learning and criticism.

More generally, the inherent weirdness of alignment makes it nigh impossible to find one unique methodology of doing it. We need everything we can get, and that implies a more pluralistic epistemology. Which means the epistemology of different research approaches must be considered and studied and made explicit, if we don’t want to be confusing for the rest of the world, and each other too.

I’m planning on focusing on such epistemic analyses in the future, both for the main ideas and concepts we want to teach to new researchers and for the state-of-the art work that needs to be questioned and criticized productively.

On Solving Problems Before They Appear: The Weird Epistemologies of Alignment