Possible miracles

Epistemic status: Speculative and exploratory.

Contributions: Akash wrote the initial list; Thomas reviewed the list and provided additional points. Unless specified otherwise, writing in the first person is by Akash and so are the opinions. Thanks to Joshua Clymer, Tom Shlomi, and Eli Lifland for comments. Thanks to many others for relevant conversations.

If we need a miracle, where might it come from? What would it look like?

Many of the arguments presented in List of Lethalities are compelling to me. Some of my colleagues and I spend many hours thinking about how hard the alignment problem is, analyzing various threat models, and getting familiar with all the ways we could fail.

I thought it would be a useful exercise to intentionally try to think in the opposite way. How might we win?

I have found “miracles” to be a helpful frame, but it’s misleading in some ways. For example, I think the “miracles” frame implies an extremely low chance of success (e.g., <1%) and fosters a “wait and hope” mentality (as opposed to a “proactively make things happen” mentality). I was considering titling this something else (e.g., “Reasons for Hope” or “Possible Victory Conditions”), but these frames also didn’t feel right. With the phrase “miracles,” I’m trying to convey that (a) I don’t feel particularly confident in these ideas, (b) I am aware of many counterarguments to these claims [and indeed some are already presented in List of Lethalities], and (c) I don’t think “hope” or “victory” sets the right tone—the right tone is “ah gosh, things seem really hard, but if we win, maybe we’ll win because of something like this.”

I have found it helpful to backchain from these miracles to come up with new project ideas. If you think it’s plausible that we need a miracle, I encourage you to form your own list of possible miracles & think carefully about what kinds of projects might make each one more likely to occur (or make it more likely that we notice one in time).

With this in mind, here’s my list of possible miracles.

New Agendas

1. New people are entering the field of alignment at a faster rate than ever before. Some large programs are being designed to attract more AI safety researchers. One of these new researchers could approach the problem from an entirely new perspective (one that has not yet been discovered by the small community of <300 researchers). Other scientific fields have had moments in which one particularly gifted individual finds a solution that others had missed.

a. There are some programs that are optimizing for finding talented people, exposing them to alignment arguments, and supporting the ones who become interested in alignment.

b. A small fraction of people in the US and UK have been exposed to AI x-risk.

c. Even less effort has gone into outreach to people outside the US and UK.

2. More resources are being poured into the mentorship and training of new alignment researchers. Perhaps the miracle will not come from a new “genius” but from a reasonably smart person (or set of people) who receive high-quality mentorship.

a. The pool of mentors is also expanding. It seems plausible to me that the top 2-10% of junior alignment researchers will soon be ready to take on their own mentees. This not only increases the total pool of mentors but also increases the diversity of thought in the mentorship pool.

b. Some training programs are explicitly trying to get people to think creatively and come up with new agendas (e.g., Refine).

c. Some training programs are explicitly trying to get people from different [non-CS/math] disciplines (e.g., Philosophy Fellowship; PIBBSS).

3. The bar for coming up with new agendas lowers each year. At the very beginning of the field, people needed to identify the problem from virtually no foundation. The field needed people who could (a) identify the problem, (b) identify that the problem was extremely important, (c) discover subproblems, (d) perform research on these problems with relatively few others to discuss ideas with and relatively few prior works to guide their thinking. The field is still pre-paradigmatic, and these skills still matter. But each year, the importance of “being able to understand things from the empty string” decreases and other abilities (ex: “being able to take a concept that someone else discovered and make it clearer” or “being able to draw connections between various ideas” or “being able to prioritize between different ideas that already exist”) become more important. It seems likely that there are some people who are better at “generating solutions” (taking a messy field and looking at it more clearly than others; building on the ideas of others; coming up with new ideas and solutions in a field that has even a little bit of foundation) than “discovering problems” (starting from scratch and seeing things that no one else sees; inventing a field essentially from scratch; arguing for the legitimacy of the field).

4. The “weirdness” of AI alignment decreases each year. This means AI alignment will attract more people, but it also means that AI alignment will attract different types of people. As much as I love rationality, perhaps the miracle lies within someone with a style of thinking that is underrepresented among people who enjoy things like The Sequences/HPMOR/Yudkowsky quirks. (I must caveat of course that I think the ideas in these writing are important, and I personally enjoy the style in which they are written. But I am aware of other people who I find intelligent who find the style of writing off-putting and are more likely to join a “normie” AI alignment community than one that is associated with [stereotypes about] rationalists).

Alignment might be easier than we expect

5. Much (though notably not all) of the research around AI alignment came from a time when people were thinking about agents that are maximizing utility functions. They came from a time when people were trying to understand core truths about intelligence and how agents work. In many ways, the deep learning revolution appears to make alignment more difficult. It seems like we may get AGI sooner than we expected, through a method that we didn’t expect, a method that we don’t understand, and a method that seems nearly-impossible to understand. But perhaps there are miracles that can be discovered within this new paradigm.

a. Until recently, relatively little work in the AI x-risk community has been focused specifically on large language models. The bad news is that if GPT-X is going to get us to AGI, much of the existing work (using math/logic/agent foundations approaches), is going to be less relevant. The good news is that we might be playing a different game than we thought we were. Again, that doesn’t inherently mean this game is easier. But if someone gave me a game, and then I thought it was impossible, and then they said okay here play this new game instead, I would at least think “alright, this one might also be impossible, but maybe it’ll be easier than the last one.”

b. For example, in the current regime, LLMs seem to reason like simulators rather than consequentialist utility-maximizers. This may change as capabilities increase, but it is very possible that LLMs stay in the current simulator regime even after they are superhuman. Aligning simulators is not necessarily easier than aligning utility-maximizers, but it sure is an example of how the game could change.

c. Maybe we will be able to use language models to help us generate new ideas for alignment. It seems plausible that a model could be [powerful enough to help us meaningfully with alignment research] while also being [weak enough that we can align it]. Some people have arguments that we can’t get a model that is [powerful enough to help us] without it also being [too powerful for us to be able to align it]. I find this plausible, and honestly, likely, but it seems plausible that a (a) there is a “goldilocks zone” of capabilities [powerful enough to meaningfully help but weak enough to align] and (b) we can build models in this goldilocks zone.

6. A new paradigm might emerge besides deep learning, where distributional shift and inner alignment are not such big, fundamental issues.

7. Deceptive alignment might not be selected for. In the regime where you apply enough optimization power to get a superintelligence, perhaps the simplest training story is to become ‘corrigibly aligned’. See this comment where Evan estimates this probability conditional on different architectures.

8. Paul’s “Basin of Corrigibility″ might be very wide. If this is the case, most values that are anywhere near human values end up in the attractor well where the AGI wants to figure out what we want.

9. Interpretability might become easier as networks get closer to human level because they start using human abstractions. It only looks hopeless now because we are in the regime between ‘extremely simple’ and ‘human level’ where the models have to use confused abstractions.

10. The Sharp Left Turn (the distribution shift associated with a rapid increase in capabilities) might not be that large of a leap. It could be that alignment properties tend to generalize across this distribution shift.

11. Given enough time, we might be able to develop a secure sandbox environment that allows us to observe alignment failures such as deception and treacherous turns, and then iterate on them.

12. Takeoff speeds might be slow enough that we can study the AI at a relatively advanced capability level, iterate on different designs, and potentially get superhuman alignment research out of it. One possible reason for this might be if capability improvement during takeoff is bottlenecked by compute.

13. Some of the tools or agendas we already have will be sufficient. Perhaps we were lucky, and some mixture of really good interpretability + really good adversarial training + penalizing models that do [bad stuff] in training will be good enough. Alternatively, there could be one or more breakthroughs in theoretical agendas (e.g., selection theorems, ELK, infra-bayesianism) and the breakthrough(s) can be implemented somehow. (Note that this one feels especially strange to label as a “miracle”, except to the people who are very pessimistic about current agendas).

Timelines might be longer than we expect

14. Coordination between AI labs might lead to less capabilities research and more safety research. In the best case, this would look like coordinated agreements across all labs.

a. Unfortunately, the mission of ‘Building AGI’ is a key part of several organization’s mission statements. E.g. OpenAI says ‘We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.’

b. There may be accidents/misuses of AI before we get to AGI. This might cause people (and most notably, capabilities researchers + leadership at major AI organizations) to invest more into safety and less into capabilities.

c. To the extent that these accidents/misuses were predicted by members of the AI alignment community, this may cause leaders of AI organizations to take the AI alignment community more seriously.

15. Several AI alignment leaders and AI capabilities leaders seem to like and respect each other. Maybe more dialogue between these two groups will lead to less capabilities research and more safety research.

a. It seems like there have been few well-designed/well-structured conferences/meetings between these two groups (though it seems plausible that they may be communicating more regularly in private).

b. Current arguments around taking AI x-risk seriously are convincing to me, but I sometimes encounter intelligent and reasonable people who find them unconvincing. It is possible that as the field matures, new challenges are discovered, or old challenges are presented with more evidence/rigor.

As a brief aside, I don’t think there is a single “introduction to AI x-risk” resource that rigorously and compellingly presents, from start to finish, the core arguments around AI x-risk. The closest things I have seen are Superintelligence (which is long, written before the deep learning revolution, and sometimes perceived as overly philosophical) and List of Lethalities (which doesn’t explain a lot of ideas from first principles and would understandably be too weird/off-putting for some readers). I am aware of other resources that try to do this, but I have not yet seen anything at the level of rigor/clarity/conciseness that “meets the bar” or even seems “standard” in other fields. It seems plausible to me that leaders of AI labs have already been presented with all of the existing key arguments (e.g., through conversations with alignment researchers). But maybe (a) this isn’t true, (b) the conversations haven’t been particularly well-structured, or (c) high-quality written resources would be more persuasive than conversations. If so, there might be a miracle here.

16. Many leaders in top AI labs take AI x-risk arguments seriously, and everyone wants to make sure AI is deployed safely. But knowing what to do is hard. Maybe we will come up with more specific recommendations that AI labs can implement.

a. Some of these may result from evaluation tools (e.g., here’s a tool you can use to evaluate your AI system; don’t deploy the system unless it passes XYZ checks) and benchmarks (e.g., here’s a metric that you can use to evaluate your AI system; don’t deploy it unless it meets this score). These tools and benchmarks may not be strong enough to fool a sufficiently intelligent system, but they may buy us time and help labs/researchers understand what types of safety research to prioritize.

17. Moore’s law might slow down and compute might become a bottleneck.

18. Maybe deep learning won’t scale to AGI. Maybe a paradigm-shift will occur that replaces deep learning (or a paradigm shift occurs within a deep learning).

Miscellaneous

19. We might be able to solve whole brain emulation before AGI arrives, and then upload alignment researchers who can do research 1000x faster than existing researchers. This is also risky, because (a) the sped up humans could disempower humanity by themselves and (b) these humans might accidentally build an AGI or do capabilities research.

Final thoughts

I’m undoubtedly missing some miracles.
I encourage others to try out this exercise. Guiding prompts: If we win, how did we win? Why did we win? What did that world look like?
Other exercises I like:
1. Writing your naive hypotheses on AI alignment.
2. Writing your opinions on the arguments in list of lethalities (see also this template).
3. Writing a detailed narrative of what you expect the next several years to look like.