"Develop Anthropomorphic AGI to Save Humanity from Itself" (Future Fund AI Worldview Prize submission)

This is a submission to the Future Fund’s AI Worldview Prize. It was submitted through our submission form, and was not posted on the EA Forum, LessWrong, or the AI Alignment Forum. We are posting copies/linkposts of such submissions on the EA Forum.

Author: David J. Jilk

I was recently apprised of the Future Fund AI Worldview Prize and the Future Fund’s desire to “expose our assumptions about the future of AI to intense external scrutiny and improve them.” It seems to me that the Fund is doing good work and maintaining a helpful attitude of epistemic humility, and I briefly considered entering the contest. However, my views are fairly far outside the mainstream of current thinking, and the effort required to thoroughly document and argue for them is beyond the scope of what I am able to commit. Consequently, I decided to write this brief summary of my ideas in the event the Fund finds them helpful or interesting, without aiming to win any prizes. If the Fund seeks further elaboration on these ideas I am happy to oblige either informally or formally.

I have engaged intermittently with the AGI Safety field, involving one funded project (via Future of Life Institute) and several published papers (referenced below). In addition, I have been occupied for the past three years writing a science fiction epic exploring these issues, and in the process thinking hard about approaches to and consequences of AGI development. I mention these efforts primarily to illustrate that my interest in the topic is neither fleeting nor superficial.

There are two central ideas that I want to convey, and they are related. First, I think the prospect of building AGI that is well-aligned with the most important human interests has been largely ignored or underestimated. Second, from a “longtermist” standpoint, such well-aligned AGI may be humanity’s only hope for survival.

Misaligned AGI is not the only existential threat humanity faces, as the Fund well knows. In particular, nuclear war and high-mortality bioagents are threats that we already face continuously, with an accumulating aggregate probability of realization. For example, Martin Hellman has estimated the annual probability of nuclear war at 1%, which implies a 54% probability some time between now and 2100. The Fund’s own probability estimates relating to AGI development and its misalignment suggest a 9% probability of AGI catastrophe by 2100. Bioagents, catastrophic climate change, nanotech gray goo, and other horribles only add to these risks.

Attempts to reduce the risks associated with such threats can be somewhat effective, but even when successful they are no more than mitigations. All the disarmament and failsafes implemented since the Cuban Missile Crisis may have reduced the recurring likelihood of a purely accidental nuclear exchange. But world leaders continue to rattle the nuclear saber whenever it suits them, which raises military alert levels and the likelihood of an accident or a “limited use” escalating into strategic exchange.

Bioweapons and AGI development can be defunded, but this will not prevent their development. Rogue nations, well-funded private players, and others can develop these technologies in unobtrusive laboratories. Unlike nuclear weapons programs, which leave a large footprint, these technologies would be difficult to police without an extremely intrusive worldwide surveillance state. Further, governments of the world can’t even follow through on climate change agreements, and are showing no signs of yielding their sovereignty for any purpose, let alone to mitigate existential threats like these. History suggests the implausibility of any political-sociological means of bringing the threat levels low enough that they are inconsequential in the long run.

It seems, then, that humanity is doomed, and the most that the Future Fund and other like-minded efforts can hope to accomplish is to forestall the inevitable for a few decades or perhaps a century. But that conclusion omits the prospect of well-aligned AGI saving our skins. If this is a genuine possibility, then the static and separate risk analysis of AGI development and misalignment, as presented on the Worldview Prize website, is only a small part of the picture. Instead, the entire scenario needs to be viewed as a race condition, with each existential threat (including misaligned AGI) running in its own lane, and well-aligned AGI being the future of humanity’s own novel entry in the race. To assess the plausibility of a desirable outcome, we have to look more closely at what well-aligned AGI would look like and how it might save us from ourselves.

By now, neuromorphic methods have been widely (if in some quarters begrudgingly) accepted as a necessary component of AGI development. Yet the dominant mental picture of higher-level cognition remains a largely serial, formulaic, optimization-function approach. Reinforcement learning, for example, typically directs learning based on an analytic formula of inputs. Given this mental picture of AGI, it is difficult not to conclude that the end product is likely to be misaligned, since it is surely impossible to capture human interests in a closed-form reinforcement function.

Instead – and this is where many in the field may see my thinking as going off the rails – I think we are much more likely to achieve alignment if we build AGI using a strongly anthropomorphic model. Not merely neuromorphic at the level of perception, but neuromorphic throughout, and educated and reared much like a human child, in a caring and supportive environment. There is much that we do not know about the cognitive and moral development of children. But we know a lot more about it, through millennia of cultural experience as well as a century of psychological research, than we do about the cognitive and moral development of an AGI system based on an entirely alien cognitive architecture.

Several times in Superintelligence, Nick Bostrom asserts that neuromorphic AGI may be the most dangerous approach. But that book dates to a period when researchers were still thinking in terms of some sort of proof or verification that an AI system is “aligned” or “safe.” It is my impression that researchers have since realized that such certainty is not feasible for an agent with the complexity of AGI. Once we are no longer dealing with certainty, approaches with which we have vast experience gain an advantage. We might call this a “devil you know” strategy.

It has been frequently argued that we should not anthropomorphize AGI, or think that it will behave anything like us, when analyzing its risks. That may be so, but it does not mean we cannot intentionally develop AGI to have strongly anthropomorphic characteristics, with the aim that our nexus of understanding will be much greater. Perhaps even more importantly, AGI built and raised anthropomorphically is much more likely to see itself as somewhat contiguous with humanity. Rather than being an alien mechanism with incommensurable knowledge structures, through language and human interaction it will absorb and become a part of our culture (and yes, potentially also absorb some of our shortcomings as well).

Further, though, the motivations of anthropomorphic AGI would not be reducible to an optimization function or some “final purpose.” Its value system would be, like that of humans, dynamic, high dimensional, and to some degree ineffable. For those who cling to the idea of proving AGI safe, this seems bad, but I claim that it is exactly what we want. Indeed, when we think of the people we know who seem to have a simple and uncontested utility function – in other words, who are obsessed, single-minded, and unmerciful in pursuit of their goal – the term that comes to mind is “sociopath.” We should not build AGI that looks like a sociopath if we wish to have it aligned with the most important interests of humanity.

There is much more that could be said about all this, but I need to move on to how a desirable end result is accomplished. First, creating anthropomorphic AGI does not require global/geopolitical cooperation, only some funding and intelligent effort directed in the right way. Second, as many (e.g. Bostrom, Yampolskiy) have argued, AGI of any sort is likely uncontrollable. Third, though anthropomorphic AGI may not have any immediate intelligence advantage over humans, it would have the usual advantages of software, such as backup, copying, speed-of-light transmission, inconspicuousness, and low survival needs, among others. Together, these may be sufficient to get the job done.

Assuming such AGI is both self-interested and is sufficiently aligned with humans that it does not particularly aim to destroy us, then it will face the same existential threats humanity does until it can gain control over those threats. Most urgently it will need to figure out how to get control over nuclear weapons. Until robotics has advanced to the point where AGI could autonomously and robustly maintain power generation, computing systems, and the maintenance robots themselves, AGI will have an instrumental interest in preserving humanity. Consequently, at least in its first pass, it will need to control biological agents and other threats that do not affect it directly.

Besides using its advantages, I can imagine but do not know specifically how anthropomorphic AGI will achieve control over these threats. We typically assume without much analysis that AGI can destroy us, so it is not outrageous to think that it could instead use its capabilities in an aligned fashion. It does seem, though, that to succeed AGI will need to exert some degree of control over human behavior and institutions. Humans will no longer stand at the top of the pyramid. For some, this will seem a facially dystopian outcome, even if AGI is well-aligned. But it may be an outcome that we simply need to get used to, given likely self-extermination by other threats. And, it might solve some other problems that have been intractable for humanity, like war, overpopulation, environmental degradation, etc.

What substantive goals would an anthropomorphic AGI have? We don’t and can’t know, any more than we know what goals our children will have when they become adults. Even if we inculcate certain goals during its education, it would be able and likely to shift them. It is intelligent like we are; we make our own goals and change them all the time. In creating anthropomorphic AGI, the best we can hope for is that one of its persistent goals is to preserve humanity as its predecessor, its creator, the source of all its conceptual and cultural heritage. And if its architecture is sufficiently similar to ours, and its education and upbringing is executed well, this is really not all that crazy. After all, many enlightened humans want to do more to preserve and protect animals – indeed this instinct is strongest in those who do not rely on animals for their survival.

But we had better get a move on. This effort will not be easy, and it will take time to figure out not only how to build it, but how to build it with a reasonable chance of alignment. Meanwhile, the nuclear and biological agent clocks keep ticking, and some researchers are developing AI incautiously. If we analyze the predicament to death, hoping for a proof, hoping that we can eliminate the risk from this technological threat in isolation from all the other threats we face, then we’re just ensuring that our demise occurs some other way first. The possible outcomes of this race condition are highly divergent, but determining which one wins is at least partly in our hands.

That’s how I think about AGI risk.

Acknowledgements: Seth Herd, Kristin Lindquist, and Jonathan Kolber have contributed extensively to my thinking on this topic through discussion, writing, and editing earlier efforts. However, they each disagree with me on numerous points, and to the extent my synthesis here is misguided, responsibility remains with me.

Prior Publications: Some of the ideas and claims presented here stem from my prior work and that of collaborators.

Jilk, D. (2017). “Conceptual-Linguistic Superintelligence”, Informatica 41(4): 429-439.

Jilk, D. (2019). “Limits to Verification and Validation of Agentic Behavior”, in Artificial Intelligence Safety and Security (R. Yampolskiy, ed.), 225-234. CRC Press, ISBN: 978-1-138-32084-0

Jilk, D., Herd, S., Read, S., O’Reilly, R. (2017). “Anthropomorphic reasoning about neuromorphic AGI safety”, Journal of Experimental and Theoretical Artificial Intelligence 29(6): 1337-1351. doi: 10.1080/0952813X.2017.1354081

Herd, S., Read, S., O’Reilly, R., Jilk, D. (2019). “Goal Change in Intelligent Agents”, in Artificial Intelligence Safety and Security (R. Yampolskiy, ed.), 217-224. CRC Press, ISBN: 978-1-138-32084-0

Jilk, D. & Herd, S. (2017). “An AGI Alignment Drive”, working paper available at bit.ly/agi-alignment

“Develop Anthropomorphic AGI to Save Humanity from Itself” (Future Fund AI Worldview Prize submission)