A Defense of Work on Mathematical AI Safety

AI Safety was, a decade ago, nearly synonymous with obscure mathematical investigations of hypothetical agentic systems. Fortunately or unfortunately, this has largely been overtaken by events; the successes of machine learning and the promise, or threat, of large language models has pushed thoughts of mathematics aside for many in the “AI Safety” community. The once pre-eminent advocate of this class of “agent foundations” research for AI safety, Eliezer Yudkowsky, has more recently said that timelines are too short to allow this agenda to have a significant impact. This conclusion seems at best premature.

Foundational research is useful for prosaic alignment

First, the value of foundational and mathematical research can be synergistic with both technical progress on safety, and with insight into how and where safety is critical. Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research. Current mathematical research could play a similar role in the coming years, as more funding and research are increasingly available for safety. We have also repeatedly seen the importance of foundational research arguments in discussions of policy, from Bostrom’s book to policy discussions at OpenAI, Anthropic, and DeepMind. These connections may be more conceptual than direct, but they are still relevant.

Long timelines are possible

Second, timelines are uncertain. If timelines based on technical progress are short, many claim that we have years not decades until safety must be solved. But this assumes that policy and governance approaches fail, and that we therefore need a full technical solution in the short term. It also seems likely that short timelines make all approaches less likely to succeed. On the other hand, if timelines for technical progress are longer, fundamental advances in understanding, such as those provided by more foundational research, are even more likely to assist in finding or building more technical routes toward safer systems.

Aligning AGI ≠ aligning ASI

Third, even if safety research is successful at “aligning” AGI systems, both via policy and technical solutions, the challenges of ASI (Artificial SuperIntelligence) still loom large. One critical claim of AI-risk skeptics is that recursive self-improvement is speculative, so we do not need to worry about ASI, at least yet. They also often assume that policy and prosaic alignment is sufficient, or that approximate alignment of near-AGI systems will allow them to approximately align more powerful systems. Given any of those assumptions, they imagine a world where humans and AGI will coexist, so that even if AGI captures an increasing fraction of economic value, it won’t be fundamentally uncontrollable. And even according to so-called Doomers, in that scenario, for some period of time it is likely policy changes, governance, limited AGI deployment, and human-in-the-loop and similar oversight methods to limit or detect misalignment will be enough to keep AGI in check. This provides a stop-gap solution, optimistically for a decade or even two—a critical period—but is insufficient later. And despite OpenAI’s recent announcement that they plan to solve Superalignment, there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment, and policy-centric approaches will not allow control.

Resource Allocation

Given the above claims, a final objection is based on resource allocation, in two parts. First, if language model safety was still strongly funding constrained, those areas would be higher leverage, and avenues of foundational and mathematical research would be less marginally beneficial routes for spending. Similarly, if the individuals likely to contribute to mathematical AI safety were all just as well suited to computational deep learning safety research, their skills might be better directed towards machine learning safety. Neither of these is the case.

Of course, investments in agent foundations research are unlikely to directly lead to safety within a few years, and it would be foolish to abandon or short-change efforts that are critical to the coming decade. But even in the short term, these approaches may continue to have important indirect effects, including both deconfusion, and informing other approaches.

As a final point, pessimistically, these types of research are among the least capabilities-relevant AI safety work being considered, so they are low risk. Optimistically, this type of research is very useful in the intermediate term future, and is invaluable should we manage to partially align language models, and need to consider what is next for alignment.

Thank you to Vanessa Kosoy and Edo Arad for helpful suggestions and feedback. All errors are, of course, my own.