AI safety and consciousness research: A brainstorm
I’ve been collecting people’s thoughts on the potential of consciousness research to advance AI safety. Here are some rough answers I’ve come across:
1. Understanding ethics / human values
These approaches argue understanding consciousness could help us understand what we mean by a human-aligned AI (the value-loading problem).
However, this approach runs into problems once we start asking what exactly about our values we should specify, and what we should leave up to the superintelligence’s own closer examination. Bostrom lays out a solid argument for trusting an aligned AGI’s judgement over our own through the principle of epistemic deference: Assuming an AGI has been programmed to reason as a perfectly open-minded and well-meaning human who thought about moral questions longer and deeper than the best philosophers, we should assume that in the case of a disagreement with such an AI, it’s more likely that the AI is right, as it can presumably encompass all of our own ideas but also some other ones. This leads us to indirect normativity—the idea that similarly to laws, the rules that we encode in an aligned AI should be rather vague, so that it can correct for their problems upon closer examination.
These considerations suggest that advancing our understanding of ethics / human values to the literal molecular level wouldn’t really be helpful, as we should avoid locking in any of our specific present notions of values. Here are several answers to this argument:
Even if we accept that in the case of a disagreement, the formulations of our values should be flexible for revision, having a better model of human values might increase the chances we specify them correctly. For instance, the prompt that we give to an AGI could involve this model as what we want to extrapolate (because that’s “how far we’ve got on our own”). Indirect normativity posits that our prompt should be minimal but perhaps we can form a superior minimal prompt based on a more advanced model of human values.
It seems intuitively likely there’s a lower risk of mesa-optimisation misalignment if there are fewer cognitive steps between the values specified in the prompt and the values we would want if optimally extrapolated. For example, an AI optimizing for “the human concept of good” could simulate the extrapolated values of the average human and become a religious fundamentalist. However an AI optimizing for “positive freedom to choose the best qualia” might be motivated to anchor its values in the best model of the interpretation of qualia & preferences it can come up with. [1]
It might be that we won’t be sure if we’ve solved the control problem. If there’s a tense race between an almost certainly aligned AI and an almost certainly misaligned AI, locking in slightly suboptimal values might be the better option. Additionally, if we are more certain about the validity of our value theories—or if we at least develop a better framework for researching the value dimension of consciousness—we are also better prepared for a situation where we would be forced to choose between several sub-optimal AIs.
The need for indirect normativity should probably be treated as a heuristic, rather than a logical law. It seems possible that research which would clarify how values are best represented might also find that the balance between not specifying anything and specifying too much doesn’t lie where we would intuitively think it does.
If we have a good model of what we care about, we have a mechanism how to check whether an AI is truly aligned or whether it’s wrong or trying to manipulate us.
Counterargument: An AI that produces good answers to ethical questions is no guarantee of alignment. So avoiding a catastrophe means solving the control problem anyway, part of which will be being able to ask the AI to explain its reasoning so transparently, it will be clear to us it’s doing what we want.
What do we do if the AI comes up with something very counter-intuitive (tile the universe with hedonium)? How do we check if the AI extrapolated our values correctly if its philosophy seems impossible to comprehend? We either need to understand what exactly we mean by “correct extrapolation” or what exactly our values are. Since the thing we want to extrapolate is likely deeply connected to optimising the contents of our consciousness, it seems that consciousness research could be useful for both of these options.
The same problem applies if the AI comes up with something intuitive. We might think we’ve got an aligned AI but it could just be that we’ve made an AI that replicates the mistakes of our moral intuitions. In other words, it could be that we make a misaligned AI just because we can’t see how it’s misaligned until we make progress in human values.
If there’s a tense race between an almost certainly aligned AI and an almost certainly misaligned AI, we may not have enough time to try to integrate something very counter-intuitive into our definition of what the AI is supposed to achieve—whether by accepting it or explicitly forbidding it in the task prompt—unless there’s already a body of advanced consciousness research by the time AGI arrives.
Another possibility is that we develop an AI that’s almost certainly aligned but not transparent. A typical powerful AI we imagine is very general, so it may seem weird that it would be just bad at “reading its own mind” but it’s possible that analyzing architecture such as a neural network takes more computing power than the network itself can provide. In this case, it sounds likely that such an AI couldn’t even tell whether creating an analyzer AI powerful enough to analyze this semantic AI would be safe. In this scenario, consciousness research would help as a check for alignment.
Consciousness or ethics could be an area where AI can’t make progress because it lacks some information we get from consciousness. However one could object that it seems weird to think a superintelligence would just miss that consciousness is central for how we think about ethics. And if it wouldn’t miss that, it could employ humans to fill in the necessary missing bits. Some answers:
The fact there seems to be an impenetrable barrier between the kind of information we can describe with physics and the kind that manifests in our conscious experience could lead to the counter-intuitive conclusion that even a superintelligence might miss something crucial about the human experience—since it’s information qualitatively different from anything it can access, it might not even know it misses up on it.
In other words, Mary could be infinitely intelligent and still not get what we mean by red. But what’s worse, her infinite intelligence could make her feel convinced there’s nothing to miss. It seems Mary would most naturally tend to think color itself is a meaningless concept. It seems that the most natural philosophy for a consciousness-lacking AGI would be illusionism and the related position that ethics is a meaningless concept.
One pathway towards AGI that currently seems quite likely is an AI that simulates a human (~a LLM). Sure, it’s possible that if a simulated human lacked inner experience, they would be able to report that. However, it’s hard to say because there is no learning data for this situation, as there don’t seem to be such humans. Everyone behaves the way a philosophical zombie would—with the exception of being interested in consciousness.[2] However, a well simulated human would act as if they’re interested in consciousness and as if they understand it. This could lead to the AI latching onto a wrong proxy model of consciousness such as “it’s when people report on their own neural algorithms”.
Improving (meta-)ethics might help create a better social environment for approaching AI safety with more reasonable assumptions.
It might be that actors, whether governments or AI developers, would want to lock in certain values but don’t realize they are instrumental, rather than terminal. For instance, the imperative not to think in stereotypes that has been programmed into ChatGPT has had the unintended consequence that its reasoning about statistics seems contradictory. However, setting specific values in stone, instead of letting the AGI extrapolate the principles behind them could be exactly what leads to optimising a wrong proxy value, leading to an x- or s-risk. This could be mitigated by improving meta-ethics perhaps in a style similar to the work of Sharon H. Rawlette—by clarifying the delineation between biases and values. For instance, this could allow some actors to realize some biases in their naive moral intuitions they might wish to lock in otherwise.
Advancing philosophy improves the learning data. If reasonable ethics become main-stream among philosophers, there’s a higher chance they get adopted by an AGI.
2. Advancing alignment research
Assuming humans are more or less aligned, understanding how we do it might be useful for AI alignment.[3]
Although this idea most naturally leads to studying human information processing without the need to see how it relates to qualia, I think there’s a good chance the consciousness frame can enrich these areas. For example, unless we understand consciousness, we might miss a crucial part of what “representations in the human brain / cognitive models” mean.
The alignment project could be framed as a race between developers of general AI capabilities and capabilities useful for alignment such as moral reasoning where the phenomenological basis of human values could play a special role.
The PIBBSS framing: Deconfusing ourselves about the basic philosophical underpinnings of intelligence, goals/motivation or cognitive processing might be a good way to find out what is there to think about.
This could involve consciousness, since we clearly seem to be especially confused about it: Subjectively, it seems like a force that determines everything in the brain. Since we talk about it, we know it has causal properties. Yet, from the outside view, it seems all other physical phenomena can be predicted without understanding this apparent force.
Some people claim understanding consciousness would lead to a better understanding of seemingly chaotic behaviors of intelligent physical systems. A truly provably beneficial AI requires being able to predict an AI’s behavior down to the molecular level and consciousness is a real phenomenon physics yet can’t explain, suggesting current physics can’t guarantee that yet unseen systems like an ASI would not display emergent phenomena that change the original physical architecture.
This is the approach QRI could advocate, suggesting that if we built a system which has experiences, it could adopt open individualism (the theory of self which encompasses everything conscious) and in result, be more likely to understand & value what we value.
Similar approaches require the belief that consciousness is a physical phenomenon with predictable causal power. In contrast, some people might argue that consciousness influences the world indeterministically through something akin to free will (inspired by Ansgar Kamratowski).
Counterargument: Theoretical indeterministic effects would by definition need to be impossible to predict, fulfilling the Bayesian definition of randomness. Their magnitude would probably be confined to quantum effects and they would be just as likely to make a good AI go wrong, as a bad AI go right. Random effects can be described as “statistically deterministic” and we can treat them as physical laws (more detailed explanation in this document). Nevertheless, the hypothesis that biological intelligence utilizes poorly understood physical laws could be an important consideration for alignment.
3. Meta reasons
Both the field of consciousness and the field of AI safety are full of uncertainties and seem high-risk high-reward in nature. This means that even if smart people make good arguments against these reasons to pursue consciousness research, it might be beneficial to diversify our endeavors, as making sure we understand human values seems robustly good.
It’s possible that these uncertainties create biases even in our basic framing of the problem of “aligning AI with human values”. For instance, the possibility that identity, preferences or moral intuitions relate to some fundamental phenomena in consciousness could imply a different approach to programming the AI’s “constitution”[4].
Similarly, the possibility there’s something fundamental about moral intuitions might require a better grasp on which elements of sentience give rise to a moral agent, i. e. whose moral intuitions we care about. Or perhaps, as some illusionists may suggest, our intuitions about what “perceiving X as valuable” means may be misguided.
Epistemic status: A question
I’ve been trying to get a grasp on whether the possibility to make progress in consciousness research is under- or overrated for a few years now with no answer.
On one hand,
it’s a problem that has fascinated a lot of people for a long time
if a field is attractive, we can expect a lot of non-altruistically motivated people to work on it, more so if the field seems to grow (1; 2)
On the other hand
neuroimaging has just become a thing—do we expect it to solve a millenia-old problem right away?
I consider philosophy to be a task especially unsuitable to the human brain, so I wouldn’t defer to the previous generations of philosophers, just like I wouldn’t defer to them on the ethics of slavery. The ideas of evolution, psychology or effective altruism emerged much later than they could’ve because in my opinion, people underestimate how much is their idea generation confined to the borders of “what’s there to think about”. And the age of computers has opened up cognitive science as “a thing to think about” only quite recently, by the standards of philosophy (the term “hard problem of consciousness” is just 2 decades old).
if a field is growing, it could be an opportunity to help direct a significant mental & economic potential into worthwhile efforts
and also an indication there is some progress to be made
There are definitely unproductive ways to research consciousness. Currently, the “pixels” of advanced functional neuroimaging consist of ~1 million neurons. This leads to a lot of research about neural correlates concluding with fuzzy inferences like “the right hemisphere lights up more”. On the opposite side of the spectrum lie philosophical papers which try to explain consciousness in tautologies. I think the hard problem of consciousness is a very legitimate one, however one that dissolves into the unstudiable question of “why is there anything at all” once we embrace a frame like objective idealism and understand how precisely each quale corresponds to each computational phenomenon.
However, I also believe there are productive big questions like “How many senses are there & what exactly are they? - i. e. Which phenomena do qualia reflect and how do these phenomena feed into our informational processing? Can we rely on how they reflect what we value or can our intuitions about what we value be wrong? Is there a reliable description of value such as intensity times valence? Or is the perceived value of an experience dependent on integrating information across time and modalities or emotional richness? What is a net-positive experience? Which cognitive phenomena allow us to access the information that we are conscious?”—which seem fundamental to prioritization/ethics, mental health and perhaps the universe.
Related articles
Kaj Sotala: Cognitive Science/Psychology As a Neglected Approach to AI Safety
Cameron Berg: Theoretical Neuroscience For Alignment Theory
Paul Christiano: The easy goal inference problem is still hard
Robin Shah: Value Learning sequence
Special thanks to Jan Votava, Max Räuker, Andrés Gómez Emilsson, Aatu Koskensilta and an anonymous person for their inspiration & notes!
- ^
What I propose here is that the risk of inner misalignment is decreased if we have a good idea about the values we want to specify (outer alignment) because it reduces the danger of misinterpretation of the values (reward function) we specified. The non-triviality of this problem is nicely explained in Bostrom’s Superintelligence, chapter Morality models.
- ^
This could be a test for digital consciousness—if we manage to delete the concept of consciousness from the learning data somewhat, does it naturally emerge, just like it has emerged in various cultures?
- ^
By “more or less aligned” I express something like: The behavior of some humans is guided by moral principles enough so that an AI that simulated their coherent extrapolated volition would seek behavior that corresponds to the most honest, well-thought and well-meaning interpretation of ethics it can come up with.
- ^
By “constitution” I mean “an algorithm determining how to determine what’s morally correct in a given moment”, a concept linked closely to meta-ethics. Under some views, this would also involve “rules that lay out the relationship between AI and other potentially conscious entities”.
Update: I’m pleased to learn Yudkowsky seems to have suggested a similar agenda in a recent interview with Dwarkesh Patel (timestamp) as his greatest source of predictable hope about AI. It’s a rather fragmented bit but the gist is: Perhaps people doing RLHF get a better grasp on what to aim for by studying where “niceness” comes from in humans. He’s inspired by the idea that “consciousness is when the mask eats the shoggoth” and suggests, “maybe with the right bootstrapping you can let that happen on purpose”.
I see a very important point here: Human intelligence isn’t misaligned with evolution in a random direction, it is misaligned in the direction of maximizing positive qualia. Therefore, it seems very likely that consciousness played a causal role in the evolution of human moral alignment—and such causal role needs to be possible to study.