Balancing safety and waste

This is a Draft Amnesty Week draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked.

TLDR: I made a back-of-the-envelope model for the value of steering the future of AI (link here).

I started with four questions:

a) How morally aligned can we expect the goals of an ASI to be?

b) How morally aligned can we expect future human goals to be?

c) How much can we expect ASI to increase or decrease human agency?

d) How would a [stronger AIS movement] affect these expectations?

Here, by agency, I mean the proportion of decisions made based on someone’s expressed preferences. In my model, I compare a world in which a superintelligence (ASI) suddenly arises (World A) with a world in which there is a boom of AI safety research (World B) or one possible goal of AI governance is achieved (World C). More in the doc.

Although Bostrom (1, 2), Ord (Precipice, Chapter 1) or MacAskill (WWOTF) tackle all of these questions, I’m not aware of a post which would put them into a single “interactive” Excel, so that’s what I tried to do. Seeing how they weigh in my mind makes me think consciousness and the reliability of progress are somewhat under-discussed parts of the equation.

I was influenced by Joscha Bach’s arguments from this debate based on

a strong credence humanity will go extinct without AGI (based on the Limits to Growth reports)
a theory of valence based on predictive processing (presumably making a future based on a random AI value function likely good)

The growth extinction angle is based on resource depletion, which I explored in this post, unable to find a credible basis for Bach’s argument. However, I think it’s reasonable to question the value of x-risk reduction if one’s uncertain that civilization without AGI could yield much positive value. Similarly, I think Bach’s specific theory of valence is likely wrong but grant that the hypothesized conclusion should be taken seriously based on a wider range of views on consciousness and AI.

As a result, my guess is that whether or not AI safety will succeed at steering the values of ASI, the future will be better than today. However, these considerations haven’t changed my general outlook: It’s much more likely that the future will be good if humanity makes a conscious effort to shape the trajectory and values of ASI and this conclusion seems robust even to quite exotic considerations.

Nevertheless, my reflection highlighted a few ideas:

1. Alignment isn’t just about the control problem

Yes, AIS increases human agency (question C) but also the probability that the amount of agency given to humans or ASI will depend on their moral alignment (interactions C-A and C-B), as well as directly improving the probability any AI that will be developed will be morally aligned (question A). To a limited extent, AIS may also improve human (moral) decision-making (question B) via the routes discussed within AI ethics (such as preventing the rise of extremism via AI manipulation).

2. Increasing human agency does not guarantee positive outcomes.

It seems a truly long-lasting value lock-in is only possible with a heavy help of AI. Therefore, the risk that we would solve the alignment problem but nevertheless irrationally prevent ourselves from building a friendly AI, seems very low—relative to the billions of years we’ve got to realize our potential, cultural evolution is quick. More on this in point 5).

This consideration also potentially suggests that one possible risk of increasing humanity’s attempts to carefully shape AI values could be increasing the chances of a value lock-in. However, I think that if we solve the control problem (i.e. humans will stay in the decision loop), an AI capable of a value lock-in would understand how our meta-values interact with our true values. In other words, coherent extrapolated volition is a more rational way of interpreting goals than taking them literally, so I have a decent faith an aligned ASI would recognize that. And it doesn’t seem like there are important differences in CEV, that is meta-values (more in point 6).

I think there’s a big chance I’m wrong here. If ASI arises by scaling a LLM, it could be analogous to a human who is very smart in terms of System 1 (can instantly produce complex plans to achieve goals) but not so rational, i.e. bright in terms of System 2 (doesn’t care to analyze, how philosophically coherent these goals are). However, these scenarios seem like precisely the kind of problem reduced by increasing the attention oriented towards AI safety.

3. Consciousness, progress and uncertainty seem like key factors.

Understanding consciousness seems important to evaluate, what value we would lose if an AI proceeded to convert the universe’s resources according to whatever value function which would happen to win the AI race. I explored this interaction more in a previous post.

Understanding progress seems important to evaluate whether humanity would be better equipped to create an ASI in 100 or 1000 years. For this purpose, I think “better equipped” can be nicely operationalized in a very value-uncertain way as “making decisions based on more reflection & evidence and higher-order considerations”. Part of this question is whether morally misaligned actors, such as authoritarian regimes or terrorists may utilize this time to catch up and perhaps use an AI to halt humanity’s potential (5).

The specific flavor of uncertainty we choose seems crucial. If it pushes us towards common-sense morality or if it pushes us to defer to later generations, AIS seems like a clear top-priority. If it pushes us towards views that assign moral patienthood to AI, it may decrease some forms of AIS (an infinite pause) while increasing others (e.g. implementing reliable AI philosophy / meta-cognition, see Chi’s recent post) (6).

4. Increasing ASI agency does not guarantee negative outcomes.

Orthogonality thesis, as proposed by Bostrom is hard to disagree with—it does seem possible to imagine an AI holding any combination of goals and intelligence. However, the thesis alone doesn’t rule out a possible correlation—i.e. the possibility that given somewhat flexible goals, it’s more likely that an AI will be morally aligned, as opposed to misaligned.

Given the grand uncertainty and importance of these questions, hoping that such a correlation exists would be a terrible plan. Nevertheless, there’s a few interesting reasons one might think it does:

Humans act as an existence proof that alignment with morality “by default” is possible (or even likely on priors) - Batson suggests people treat others’ wellbeing as an intrinsic value (i.e. true altruism exists), which is why I suspect the CEV of most of humanity would converge on a world model close to the moral ideal. However:
- This approach could be biased by anthropic effects—if we hadn’t developed morality, we wouldn’t be talking about it.
- Some suggest RLHF could be analogous to this process, most disagree
It could be that positive value means the fulfillment of preferences. In this way, virtually any ASI capable of having coherent preferences may be maximizing moral value by realizing them.
If an AI starts to reflect on what it should aim to achieve, it may have to solve what “it” is, i.e. the philosophy of self. It may conclude (personal) identity is an unsustainable concept and accept open individualism or a kind of veil of ignorance—if you don’t know in which intelligent entity you will be the next moment, you should optimize for everyone’s well-being.
Consciousness may be changing the architecture of intelligent networks. Or vice versa, intelligent networks may naturally benefit from creating positive qualia.

5. Progress with humans in charge seems reliable

If it’s true EAs are the WEIRDest of the WEIRD (sociological acronym), effective altruism seems to be contingent on the natural arrow of progress. It seems Hegel was right, in the long-term, any value dissatisfaction creates tension and therefore, systems positive for human well-being seem more stable.

Mainly, democracies seem more stable than autocracies. The typical story of both right and left authoritarian regimes of the 20th century seems to be a spontaneous collapse—or (in the case of China or Vietnam) adaptation to become more tolerable. The spirit of democracy seems so omnipresent that existing authoritarian regimes generally pay lip service to democracy and seem pressured to accustom to the opinions of the populations. In China, around 90 % of people support democracy, in Arabic countries, this figure reaches around 72 %.
One could fear that the disproportion between the birth rates among religious fundamentalists and the cosmopolitan population could make us expect the future to have less rational values. To evaluate this hypothesis, one could inspect the demographic projections of religiosity, as a crude heuristic. Indeed, a look at the global projection for 2050 shows a 3% decline in irreligion. However, I suspect that as the demographic revolution unfolds and people become richer, religious practice will become more reminiscent of the rich parts of the Arabic world. Eventually, I think we should expect these regions to follow the current demographic trends in the US, where irreligion is on the rise. Here, my point isn’t to argue these specific trends are necessarily optimistic but rather, that in rich societies, horizontal memetic cultural evolution (ideas spreading) seems faster than vertical one (ideas getting “inherited”).
One could fear that populism will get more intense with AI, leading to worse governance. I think this is a problem we should be taking seriously. Nevertheless, I again think the evidence leans towards optimism. Firstly, AI may also improve the defense of social media against fake news. Secondly, new populism does not seem dependent on false information per se—rather, misleading interpretations of reality. Being 1 SD more exposed to fake news only increases populist voting by 0,19 SD. Similarly, conspiracy beliefs don’t seem to have changed much over the last years. And importantly, our credence towards video evidence seems to be matching the decreasing costs of creating deep fakes. Currently, deep fakes that are impossible to recognize are already very easy to make—nevertheless none of the attempts to change wars and elections in this way seemed to have made a significant difference so far.

Let’s say humans won’t become an interplanetary species. In such a case, I’d expect our species to continue thriving on this planet for the remaining lifetime of Earth, i.e. something like 500 million years. Let’s say current AI safety efforts do overshoot and in result, our civilization implements a tough international law that prevents civilization from making use of the positive side of AI and spreading between the stars. This could constitute a suboptimal lock-in. However, it seems unlikely to me that without AI, humans would be able to lock-in a bad idea for long enough to matter. In the 17th century slavery and witch trials were commonly accepted. If it took us a hundred times longer to reach some moral threshold, we would have just used up 0.006% of the remaining lifetime of our planet. In the optimistic scenario we utilize the full lifetime of the universe, this time could be trillion times longer.

6. “Indiscriminate moral uncertainty” supports AIS

Naively, absolute moral uncertainty would imply practical moral nihilism. Every moral claim would have a 50 % probability of being true—therefore, there’s no reason to judge actions on moral basis. However, such a position requires ~100 % credence that for each claim, this probability is indeed 0.5 and no further inspection can move it by any margin, which is paradoxically an expression of ridiculous certainty. True moral uncertainty probably leads to attempts to increase humanity’s philosophical reflection. This seems philosophically very straight-forward:

Yes, humans disagree about values to an extent that assigns negative value to most charitable attempts from some perspective. However, compared to our value disagreements, our meta-value disagreements seem incredibly small—nearly all believe to choose beliefs according to what is true and what brings fulfillment.
It seems hard to argue that more reflection gets us farther from the truth. And it seems hard to argue that knowing the truth brings us less of what we meta-value. Therefore, steering progress towards reflection seems like a robust way to increase the fulfillment of humanity’s meta-values.

AIS could be a necessary precursor to make sure we have time for such reflection. This is a less “obviously true” statement but the uncertainty is epistemic, not moral. And provided ASI doesn’t happen in our lifetimes, such effort would merely be a waste, not actively harmful, which seems positive from the position of a “sincere” moral uncertainty.

Lastly, more uncertainty about cause X increases the necessity to develop an (aligned) ASI. For instance, one could argue that perhaps the universe is full of deadly rays that wipe life out the moment they meet it but we can’t observe any signs of it, because once we could observe them, we’d already be dead. However, I think the Grabby Aliens model provides an interesting argument against this reasoning—just based on conventional assumptions about the great filters, our civilization is suspiciously early in the universe (see this fun animated explainer). Therefore, any additional historical strong selection effect seems unlikely on priors.