Thanks for the write-up. I’m excited about people presenting well thought-through cases for the value of different domains.
I want to push back a bit against the claim that the problem is time-sensitive. If we needed to directly specify what we valued to a powerful AI, then it would be crucial that we had a good answer to that by the time we had such an AI. But an alternative to directly specifying what it is that we value is to specify the process for working out what to value (something in the direction of CEV). If we can do this, then we can pass the intellectual work of this research off to the hypothesised AI. And this strategy looks generally very desirable for various robustness reasons.
Putting this together, I think that there is a high probability that consciousness research is not time-critical. This is enough to make me discount its value by perhaps one-to-two orders of magnitude. However, it could remain high-value even given such a discount.
(I agree that in the long run it’s important. I haven’t looked into your work beyond this post, so I don’t (yet) have much of a direct view of how tractable the problem is to your approach. At least I don’t see problems in principle.)
Thanks for the comment! I think the time-sensitivity of this research is an important claim, as you say.
My impression of how MIRI currently views CEV is that it’s ‘a useful intuition pump, but not something we should currently plan to depend on for heavy lifting’. In the last MIRI AMA, Rob noted that
I discussed CEV some in this answer. I think the status is about the same: sounds like a vaguely plausible informal goal to shoot for in the very long run, but also very difficult to implement. As Eliezer notes in https://arbital.com/p/cev/, “CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build.” The first AGI systems people develop should probably have much more limited capabilities and much more modest goals, to reduce the probability of catastrophic accidents.
As an intuition pump, rough sketch, or placeholder, I really like CEV. What I’m worried about is that discussion of CEV generally happens in “far mode”, and there’s very probably work that could and should be done now in order to evaluate how plausible CEV is, and explore alternatives. Four reasons not to depend too much on CEV alone:
CEV is really hard. This seems consistent with what Rob & Eliezer have said.
CEV may not be plausible. A failure mode acknowledged in the original document is that preferences may never cohere—but I would add that CEV may simply be too underdefined & ambiguous to be useful in many cases. E.g., a “preference” is a rather leaky abstraction sometimes to begin with. A lot of possibilities look reasonable from far away, but not from up close, and CEV might be one of these.
CEV may give bad answers. It seems entirely possible that any specific implementation of CEV would unavoidably include certain undesirable systemic biases. More troublingly, maybe preference utilitarianism is just a bad way to go about ethics (I think this is true, personally).
Research into qualia may help us get CEV right. If we define the landscape of consciousness as the landscape within which morally-significant things happen, then understanding this landscape better should help us see how CEV could- or couldn’t- help us navigate it.
Aside from these CEV-specific concerns, I think research into consciousness & valence could have larger benefits to AI safety- I wrote up some thoughts on this last year at http://opentheory.net/2015/09/fai_and_valence/ .
Rather than time-sensitivity, another way to frame this could be path-dependence based on order of technological development. Do we get better average & median futures if we attempt to build AI without worrying much about qualia, or if we work on both at once?
(Granted, even if this research is all I say it is, there are potential pitfalls of technological development down this path.)
Act-based agents, which defer to humans to a large extent. The goal is to keep humans in control of the future.
Task AI, which is used to accomplish concrete objectives in the world. The idea would be to use this to accomplish goals people would want accomplished using AI (including reducing existential risk), while leaving the future moral trajectory in the hands of humans.
Both proposals end up deferring to humans to decide the long-run trajectory of humanity. IMO, this isn’t a coincidence; I don’t think it’s likely that we get a good outcome without deferring to humans in the long run.
Some more specific comments:
If pleasure/happiness is an important core part of what humanity values, or should value, having the exact information-theoretic definition of it on-hand could directly and drastically simplify the problems of what to maximize, and how to load this value into an AGI
There’s one story where this makes a little bit of sense, where we basically give up on satisfying any human values other than hedonic values, and build an AI that maximizes pleasure without satisfying any other human values. I’m skeptical that this is any easier than solving the full value alignment problem, but even if it were, I think this would be undesirable to the vast majority of humans, and so we would collectively be better off coordinating around a higher target.
If we’re shooting for a higher target, then we have some story for why we get more values than just hedonic values. E.g. the AI defers to human moral philosophers on some issues. But this method should also succeed for loading hedonic values. So there isn’t a significant benefit to having hedonic values specified ahead of time.
Even if pleasure isn’t a core terminal value for humans, it could still be used as a useful indirect heuristic for detecting value destruction. I.e., if we’re considering having an AGI carry out some intervention, we could ask it what the expected effect is on whatever pattern precisely corresponds to pleasure/happiness.
This seems to be in the same reference class as asking questions like “how many humans exist” or “what’s the closing price of the Dow Jones”. I.e. you can use it to check if things are going as expected, though the metric can be manipulated. Personally I’m pessimistic about such sanity checks in general, and even if I were optimistic about them, I would think that the marginal value of one additional sanity check is low.
There’s going to be a lot of experimentation involving intelligent systems, and although many of these systems won’t be “sentient” in the way humans are, some system types will approach or even surpass human capacity for suffering.
See Eliezer’s thoughts on mindcrime. Also see the discussion in the comments. It does seem like consciousness research could help for defining a nonpersonhood predicate.
I don’t have comments on cognitive enhancement since it’s not my specialty.
Some of the points (6,7,8) seem most relevant if we expect AGI to be designed to use internal reinforcement substantially similar to humans’ internal reinforcement and substantially different from modern reinforcement learning. I don’t have precise enough models of such AGI systems that I feel optimistic about doing research related to such AGIs, but if you think questions like “how would we incentivize neuromorphic AI systems to do what we want” are tractable then maybe it makes sense for you to do research on this question. I’m pessimistic about things in the reference class of IIT making any progress on this question, but maybe you have different models here.
I agree that “Valence research could change the social and political landscape AGI research occurs in” and, like you, I think the sign is unclear.
(I am a MIRI research fellow but am currently speaking for myself not my employer).
Thanks for the thoughtful note. I do want to be very clear that I’m not criticizing MIRI’s work on CEV, which I do like very much! - It seems like the best intuition pump & Schelling Point in its area, and I think it has potential to be more.
My core offering in this space (where I expect most of the value to be) is Principia Qualia- it’s more up-to-date and comprehensive than the blog post you’re referencing. I pose some hypotheticals in the blog post, but it isn’t intended to stand alone as a substantive work (whereas PQ is).
But I had some thoughts in response to your response on valence + AI safety:
->1. First, I agree that leaving our future moral trajectory in the hands of humans is a great thing. I’m definitely not advocating anything else.
->2. But I would push back on whether our current ethical theories are very good- i.e., good enough to see us through any future AGI transition without needlessly risking substantial amounts of value.
To give one example: currently, some people make the claim that animals such as cows are much more capable of suffering than humans, because they don’t have much intellect to blunt their raw, emotional feeling. Other people make the claim that cows are much less capable of suffering than humans, because they don’t have the ‘bootstrapping strange loop’ mind architecture enabled by language, and necessary for consciousness. Worryingly, both of these arguments seem plausible, with no good way to pick between them.
Now, I don’t think cows are in a strange quantum superposition of both suffering and not suffering— I think there’s a fact of the matter, though we clearly don’t know it.
This example may have moral implications, but little relevance to existential risk. However, when we start talking about mind simulations and ‘thought crime’, WBE, selfish replicators, and other sorts of tradeoffs where there might be unknown unknowns with respect to moral value, it seems clear to me that these issues will rapidly become much more pressing. So, I absolutely believe work on these topics is important, and quite possibly a matter of survival. (And I think it’s tractable, based on work already done.)
Based on my understanding, I don’t think Act-based agents or Task AI would help resolve these questions by default, although as tools they could probably help.
->3. I also think theories in IIT’s reference class won’t be correct, but I suspect I define the reference class much differently. :) Based on my categorization, I would object to lumping my theory into IIT’s reference class (we could talk more about this if you’d like).
->4. Re: suffering computations- a big, interesting question here is whether moral value should be defined at the physical or computational level. I.e., “is moral value made out of quarks or bits (or something else)?” — this may be the crux of our disagreement, since I’m a physicalist and I gather you’re a computationalist. But PQ’s framework allows for bits to be “where the magic happens”, as long as certain conditions obtain.
One factor that bears mentioning is whether an AGI’s ontology & theory of ethics might be path-dependent upon its creators’ metaphysics in such a way that it would be difficult for it to update if it’s wrong. If this is a plausible concern, this would imply a time-sensitive factor in resolving the philosophical confusion around consciousness, valence, moral value, etc.
->5. I wouldn’t advocate strictly hedonic values (this was ambiguous in the blog post but is clearer in Principia Qualia).
->6. However, I do think that “how much horrific suffering is there in possible world X?” is a hands-down, qualitatively better proxy for whether it’s a desirable future than “what is the Dow Jones closing price in possible world X?”
->7. Re: neuromorphic AIs: I think an interesting angle here is, “how does boredom stop humans from wireheading on pleasurable stimuli?”—I view boredom as a sophisticated anti-wireheading technology. It seems possible (although I can’t vouch for plausible yet) that if we understand the precise mechanism by which boredom is implemented in human brains, it may help us understand and/or control neuromorphic AGIs better. But this is very speculative, and undeveloped.
->3. I also think theories in IIT’s reference class won’t be correct, but I suspect I define the reference class much differently. :) Based on my categorization, I would object to lumping my theory into IIT’s reference class (we could talk more about this if you’d like).
I’m curious about this, since you mentioned fixing IIT’s flaws. I came to the comments to make the same complaint you were responding to Jessica about.
I had the same response. The document claims that pleasure or positive valence corresponds to symmetry.
What people generally refer to when they speak of ‘happiness’ or ‘suffering’ - the morally significant hedonic status of a system- is the product of valenceintensityconsciousness, or the location within this combined state-space.
This does not look like a metric that is tightly connected to sensory, cognitive, or behavioral features. In particular, it is not specifically connected to liking, wanting, aversion, and so forth. So, like IIT in the cases discussed by Scott Aaronson, it would seem likely to assign huge values (of valence rather than consciousness, in this case) to systems that lack the corresponding functions, and very low values to systems that possess them.
The document is explicit about qualia not being strictly linked to the computational and behavioral functions that lead us to, e.g. talk about qualia or withdraw from painful stimuli:
In short, our brain has evolved to be able to fairly accurately report its internal computational states (since it was adaptive to be able to coordinate such states with others), and these computational states are highly correlated with the microphysical states of the substrate the brain’s computations run on (the actual source of qualia). However, these computational states and microphysical states are not identical. Thus, we would need to be open to the possibility that certain interventions could cause a change in a system’s physical substrate (which generates its qualia) without causing a change in its computational level (which generates its qualia reports). We’ve evolved toward having our qualia, and our reports about our qualia, being synchronized- but in contexts where there hasn’t been an adaptive pressure to accurately report our qualia, we shouldn’t expect these to be synchronized ‘for free’.
The falsifiable predictions are mostly claims that the computational functions will be (imperfectly) correlated with symmetry, but the treatment of boredom appears to allow that these will be quite imperfect:
Why do we find pure order & symmetry boring, and not particularly beautiful? I posit boredom is a very sophisticated “anti-wireheading” technology which prevents the symmetry/pleasure attractor basin from being too ‘sticky’, and may be activated by an especially low rate of Reward Prediction Errors (RPEs). Musical features which add mathematical variations or imperfections to the structure of music—e.g.,
syncopated rhythms (Witek et al. 2014), vocal burrs, etc—seem to make music more addictive and allows us to find long-term pleasure in listening to it, by hacking the mechanic(s) by which the brain implements boredom.
Overall, this seems systematically analogous to IIT in its flaws. If one wanted to pursue an analogy to Aaronson’s discussion of trivial expander graphs producing extreme super-consciousness, one could create an RL agent (perhaps in an artificial environment where it has the power to smile, seek out rewards, avoid injuries (which trigger negative reward), favor injured limbs, and consume painkillers (which stop injuries from generating negative reward) whose symmetry could be measured in whatever way the author would like to specify.
I think we can say now that we could program the agent in such a way that it sought out things that resulted in either more or less symmetric states, or was neutral to such things. Likewise, switching the signs of rewards would not reliably switch the associated symmetry. And its symmetry could be directly and greatly altered without systematic matching behavioral changes.
I would like to know whether the theory in PQ is supposed to predict that such agents couldn’t be built without extraordinary efforts, or that they would have systematic mismatch of their functional beliefs and behavior regarding qualia with actual qualia.
Hi Carl, thanks for your thoughts & time. I appreciate the comments.
First, to be clear, the hypothesis is that the symmetry of the mathematical object isomorphic to a conscious experience corresponds to valence. This is distinct from (although related to) the symmetry of a stimulus, or even symmetry within brain networks.
This does not look like a metric that is tightly connected to sensory, cognitive, or behavioral features. In particular, it is not specifically connected to liking, wanting, aversion, and so forth. So, like IIT in the cases discussed by Scott Aaronson, it would seem likely to assign huge values (of valence rather than consciousness, in this case) to systems that lack the corresponding functions, and very low values to systems that possess them.
I strongly disagree with this in the case of humans, fairly strongly disagree in the more general case of evolved systems, and mildly disagree in the fully general case of arbitrary systems.
First, it seems extremely like to me that evolved organisms would use symmetry as an organizational principle / attractor (Section XII);
Second, in cases where we do have some relevant data or plausible models (I.e., as noted in Sections IX and XII), the symmetry hypothesis seems plausible. I think the hypothesis does really well when one actually looks at the object-level, particularly e.g., Safron’s model of orgasm & Seth and Friston’s model of interoception;
Third, with respect to extending Aaronson’s critique, I question whether “this seems to give weird results when put in novel contexts” is a good path to take. As Eric Schwitzgebel notes, “Common sense is incoherent in matters of metaphysics. There’s no way to develop an ambitious, broad-ranging, self- consistent metaphysical system without doing serious violence to common sense somewhere. It’s just impossible. Since common sense is an inconsistent system, you can’t respect it all. Every metaphysician will have to violate it somewhere.” This seems particularly true in the realm of consciousness, and particularly true in contexts where there was no evolutionary benefit in having correct intuitions.
As such it seems important not to enshrine common sense, with all its inconsistencies, as the gold standard with regard to valence research. In general, I’d say a good sign of a terrible model of consciousness would be that it validates all of our common-sense intuitions about the topic.
The falsifiable predictions are mostly claims that the computational functions will be (imperfectly) correlated with symmetry, but the treatment of boredom appears to allow that these will be quite imperfect:
Section XI is intended as the core set of falsifiable predictions—you may be thinking of the ‘implications for neuroscience’ discussion in Section XII, some of which could be extended to become falsifiable predictions.
Overall, this seems systematically analogous to IIT in its flaws. If one wanted to pursue an analogy to Aaronson’s discussion of trivial expander graphs producing extreme super-consciousness, one could create an RL agent (perhaps in an artificial environment where it has the power to smile, seek out rewards, avoid injuries (which trigger negative reward), favor injured limbs, and consume painkillers (which stop injuries from generating negative reward) whose symmetry could be measured in whatever way the author would like to specify.
I think we can say now that we could program the agent in such a way that it sought out things that resulted in either more or less symmetric states, or was neutral to such things. Likewise, switching the signs of rewards would not reliably switch the associated symmetry. And its symmetry could be directly and greatly altered without systematic matching behavioral changes.
I would like to know whether the theory in PQ is supposed to predict that such agents couldn’t be built without extraordinary efforts, or that they would have systematic mismatch of their functional beliefs and behavior regarding qualia with actual qualia.
I’d assert- very strongly- that one could not evolve such a suffering-seeking agent without extraordinary effort, and that if one was to attempt to build one from scratch, it would be orders of magnitude more difficult to do so than making a “normal” agent. (This follows from my reasoning in Section XII.) But let’s keep in mind that whether the agent you’re speaking of is a computational program or a physical system matters a lot—under my model, a RL agent running on a standard Von Neumann physical architecture probably has small & merely fragmentary qualia.
An analogy here would be the orthogonality thesis: perhaps we can call this “valence orthogonality”: the behavior of a system, and its valence, are usually tightly linked via evolutionary processes and optimization factors, but they are not directly causally coupled, just as intelligence & goals are not causally coupled.
This hypothesis does also have implications for the qualia of whole-brain emulations, which perhaps is closer to your thought-experiment.
As I understand their position, MIRI tends to not like IIT because it’s insufficiently functionalist—and too physicalist. On the other hand, I don’t think IIT could be correct because it’s too functionalist—and insufficiently physicalist, partially for the reasons I explain in my response to Jessica.
The core approach I’ve taken is to enumerate the sorts of problems one would need to solve if one was to formalize consciousness. (Whether consciousness is a thing-that-can-be-formalized is another question, of course.) My analysis is that IIT satisfactorily addresses 4 or 5, out of the 8 problems. Moving to a more physical basis would address more of these problems, though not all (a big topic in PQ is how to interpret IIT-like output, which is an independent task of how to generate it).
Other research along these same lines would be e.g.,
We can say that a high-level phenomenon is strongly emergent with respect to a low-level domain when the high-level phenomenon arises from the low-level domain, but truths concerning that phenomenon are not deducible even in principle from truths in the low-level domain.
Suppose we have a Python program running on a computer. Truths about the Python program are, in some sense, reducible to physics; however the reduction itself requires resolving philosophical questions. I don’t know if this means the Python program’s functioning (e.g. values of different variables at different times) are “strongly emergent”; it doesn’t seem like an important question to me.
Downward causation means that higher-level phenomena are not only irreducible but also exert a causal efficacy of some sort. … [This implies] low-level laws will be incomplete as a guide to both the low-level and the high-level evolution of processes in the world.
In the case of the Python program this seems clearly false (it’s consistent to view the system as a physical system without reference to the Python program). I expect this is also false in the case of consciousness. I think almost all computationalists would strongly reject downwards causation according to this definition. Do you know of any computationalists who actually advocate downwards causation (i.e. that you can’t predict future physical states by just looking at past physical states without thinking about the higher levels)?
IMO consciousness has power over physics the same way the Python program has power over physics; we can consider counterfactuals like “what if this variable in the Python program magically had a different value” and ask what would happen to physics if this happened (in this case, maybe the variable controls something displayed on a computer screen, so if the variable were changed then the computer screen would emit different light). Actually formalizing questions like “what would happen if this variable had a different value” requires a theory of logical counterfactuals (which MIRI is researching, see this paper).
Notably, Python programs usually don’t “make choices” such that “control” is all that meaningful, but humans do. Here I would say that humans implement a decision theory, while most Python programs do not (although some Python programs do implement a decision theory and can be meaningfully said to “make choices”). “Implementing a decision theory” means something like “evaluating multiple actions based on what their effects are expected to be, and choosing the one that scores best according to some metric”; some AI systems like reinforcement learners implement a decision theory.
(I’m writing this comment to express more “computationalism has a reasonable steelman that isn’t identified as a possible position in PQ” rather than “computationalism is clearly right”)
Thus, we would need to be open to the possibility that certain interventions could cause a change in a system’s physical substrate (which generates its qualia) without causing a change in its computational level (which generates its qualia reports)
It seems like this means that empirical tests (e.g. neuroscience stuff) aren’t going to help test aspects of the theory that are about divergence between computational pseudo-qualia (the things people report on) and actual qualia. If I squint a lot I could see “anthropic evidence” being used to distinguish between pseudo-qualia and qualia, but it seems like nothing else would work.
I’m also not sure why we would expect pseudo-qualia to have any correlation with actual qualia? I guess you could make an anthropic argument (we’re viewing the world from the perspective of actual qualia, and our sensations seem to match the pseudo-qualia). That would give someone the suspicion that there’s some causal story for why they would be synchronized, without directly providing such a causal story.
(For the record I think anthropic reasoning is usually confused and should be replaced with decision-theoretic reasoning (e.g. see this discussion), but this seems like a topic for another day)
Yes, the epistemological challenges with distinguishing between ground-truth qualia and qualia reports are worrying. However, I don’t think they’re completely intractable, because there is a causal chain (from Appendix C):
Our brain’s physical microstates (perfectly correlated with qualia) -->
The logical states of our brain’s self-model (systematically correlated with our brain’s physical microstates) -->
Our reports about our qualia (systematically correlated with our brain’s model of its internal state)
.. but there could be substantial blindspots, especially in contexts where there was no adaptive benefit to having accurate systematic correlations.
Awesome, I do like your steelman. More thoughts later, but just wanted to share one notion before sleep:
With regard to computationalism, I think you’ve nailed it. Downward causation seems pretty obviously wrong (and I don’t know of any computationalists that personally endorse it).
IMO consciousness has power over physics the same way the Python program has power over physics
Totally agreed, and I like this example.
it’s consistent to view the system as a physical system without reference to the Python program
Right- but I would go even further. Namely, given any non-trivial physical system, there exists multiple equally-valid interpretations of what’s going on at the computational level. The example I give in PQ is: let’s say I shake a bag of popcorn. With the right mapping, we could argue that we could treat that physical system as simulating the brain of a sleepy cat. However, given another mapping, we could treat that physical system as simulating the suffering of five holocausts. Very worryingly, we have no principled way to choose between these interpretive mappings. Am I causing suffering by shaking that bag of popcorn?
And I think all computation is like this, if we look closely- there exists no frame-invariant way to map between computation and physical systems in a principled way… just useful mappings, and non-useful mappings (and ‘useful’ is very frame-dependent).
This introduces an inconsistency into computationalism, and has some weird implications: I suspect that, given any computational definition of moral value, there would be a way to prove any arbitrary physical system morally superior to any other arbitrary physical system. I.e., you could prove both that A>B, and B>A.
… I may be getting something wrong here. But it seems like the lack of a clean quarksbits mapping ultimately turns out to be a big deal, and is a big reason why I advocate not trying to define moral value in terms of Turing machines & bitstrings.
Instead, I tend to think of ethics as “how should we arrange the [quarks|negentropy] in our light-cone?”—ultimately we live in a world of quarks, so ethics is a question of quarks (or strings, or whatnot).
However! Perhaps this is just a failure of my imagination. What is ethics if not how to arrange our physical world? Or can you help me steelman computationalism against this inconsistency objection?
Thanks again for the comments. They’re both great and helpful.
Thanks for your comments too, I’m finding them helpful for understanding other possible positions on ethics.
With the right mapping, we could argue that we could treat that physical system as simulating the brain of a sleepy cat. However, given another mapping, we could treat that physical system as simulating the suffering of five holocausts. Very worryingly, we have no principled way to choose between these interpretive mappings.
OK, how about a rule like this:
Physical system P embeds computation C if and only if P has different behavior counterfactually on C taking on different values
(formalizing this rule would require a theory of logical counterfactuals; I’m not sure if I expect a fully general theory to exist but it seems plausible that one does)
I’m not asserting that this rule is correct but it doesn’t seem inconsistent. In particular it doesn’t seem like you could use it to prove A > B and B > A. And clearly your popcorn embeds neither a cat nor the suffering of five holocausts under this rule.
If it turns out that no simple rule of this form works, I wouldn’t be too troubled, though; I’d be psychologically prepared to accept that there isn’t a clean quarkscomputations mapping. Similar to how I already accept that human value is complex, I could accept that human judgments of “does this physical system implement this computation” are complex (and thus can’t be captured in a simple rule). I don’t think this would make me inconsistent, I think it would just make me more tolerant of nebulosity in ethics. At the moment it seems like clean mappings might exist and so it makes sense to search for them, though.
Instead, I tend to think of ethics as “how should we arrange the [quarks|negentropy] in our light-cone?”—ultimately we live in a world of quarks, so ethics is a question of quarks (or strings, or whatnot).
On the object level, it seems like it’s possible to think of painting as “how should we arrange the brush strokes on the canvas?”. But it seems hard to paint well while only thinking at the level of brush strokes (and not thinking about the higher levels, like objects). I expect ethics to be similar; at the very least if human ethics has an “aesthetic” component then it seems like designing a good light cone is at least as hard as making a good painting. Maybe this is a strawman of your position?
On the meta level, I would caution against this use of “ultimately”; see here and here (the articles are worded somewhat disagreeably but I mostly endorse the content). In some sense ethics is about quarks, but in other senses it’s about:
I think these are all useful ways of viewing ethics, and I don’t feel the need to pick a single view (although I often find it appealing to look at what some views say about what other views are saying and resolving the contradictions between them). There are all kinds of reasons why it might be psychologically uncomfortable not to have a simple theory of ethics (e.g. it’s harder to know whether you’re being ethical, it’s harder to criticize others for being unethical, it’s harder for groups to coordinate around more complex and ambiguous ethical theories, you’ll never be able to “solve” ethics once and then never have to think about ethics again, it requires holding multiple contradictory views in your head at once, you won’t always have a satisfying verbal justification for why your actions are ethical). But none of this implies that it’s good (in any of the senses above!) to assume there’s a simple ethical theory.
(For the record I think it’s useful to search for simple ethical theories even if they don’t exist, since you might discover interesting new ways of viewing ethics, even if these views aren’t complete).
Physical system P embeds computation C if and only if P has different behavior counterfactually on C taking on different values
I suspect this still runs into the same problem—in the case of the computational-physical mapping, even if we assert that C has changed, we can merely choose a different interpretation of P which is consistent with the change, without actually changing P.
If it turns out that no simple rule of this form works, I wouldn’t be too troubled, though; I’d be psychologically prepared to accept that there isn’t a clean quarkscomputations mapping. Similar to how I already accept that human value is complex, I could accept that human judgments of “does this physical system implement this computation” are complex (and thus can’t be captured in a simple rule)
This is an important question: if there exists no clean quarkscomputations mapping, is it (a) a relatively trivial problem, or (b) a really enormous problem? I’d say the answer to this depends on how we talk about computations. I.e., if we say “the ethically-relevant stuff happens at the computational level”—e.g., we shouldn’t compute certain strings—then I think it grows to be a large problem. This grows particularly large if we’re discussing how to optimize the universe! :)
I think these are all useful ways of viewing ethics, and I don’t feel the need to pick a single view (although I often find it appealing to look at what some views say about what other views are saying and resolving the contradictions between them). There are all kinds of reasons why it might be psychologically uncomfortable not to have a simple theory of ethics (e.g. it’s harder to know whether you’re being ethical, it’s harder to criticize others for being unethical, it’s harder for groups to coordinate around more complex and ambiguous ethical theories, you’ll never be able to “solve” ethics once and then never have to think about ethics again, it requires holding multiple contradictory views in your head at once, you won’t always have a satisfying verbal justification for why your actions are ethical). But none of this implies that it’s good (in any of the senses above!) to assume there’s a simple ethical theory.
Let me push back a little here- imagine we live in the early 1800s, and Faraday was attempting to formalize electromagnetism. We had plenty of intuitive rules of thumb for how electromagnetism worked, but no consistent, overarching theory. I’m sure lots of people shook their head and said things like, “these things are just God’s will, there’s no pattern to be found.” However, it turns out that there was something unifying to be found, and tolerance of inconsistencies & nebulosity would have been counter-productive.
Today, we have intuitive rules of thumb for how we think consciousness & ethics work, but similarly no consistent, overarching theory. Are consciousness & moral value like electromagnetism—things that we can discover knowledge about? Or are they like elan vital—reifications of clusters of phenomena that don’t always cluster cleanly?
I think the jury’s still out here, but the key with electromagnetism was that Faraday was able to generate novel, falsifiable predictions with his theory. I’m not claiming to be Faraday, but I think if we can generate novel, falsifiable predictions with work on consciousness & valence (I offer some in Section XI, and observations that could be adapted to make falsifiable predictions in Section XII), this should drive updates toward “there’s some undiscovered cache of predictive utility here, similar to what Faraday found with electromagnetism.”
I suspect this still runs into the same problem—in the case of the computational-physical mapping, even if we assert that C has changed, we can merely choose a different interpretation of P which is consistent with the change, without actually changing P.
It seems like you’re saying here that there won’t be clean rules for determining logical counterfactuals? I agree this might be the case but it doesn’t seem clear to me. Logical counterfactuals seem pretty confusing and there seems to be a lot of room for better theories about them.
This is an important question: if there exists no clean quarkscomputations mapping, is it (a) a relatively trivial problem, or (b) a really enormous problem? I’d say the answer to this depends on how we talk about computations. I.e., if we say “the ethically-relevant stuff happens at the computational level”—e.g., we shouldn’t compute certain strings—then I think it grows to be a large problem. This grows particularly large if we’re discussing how to optimize the universe! :)
I agree that it would a large problem. The total amount of effort to “complete” the project of figuring out which computations we care about would be practically infinite, but with a lot of effort we’d get better and better approximations over time, and we would be able to capture a lot of moral value this way.
Let me push back a little here
I mostly agree with your push back; I think when we have different useful views of the same thing that’s a good indication that there’s more intellectual progress to be made in resolving the contradictions between the different views (e.g. by finding a unifying theory).
I think we have a lot more theoretical progress to make on understanding consciousness and ethics. On priors I’d expect the theoretical progress to produce more-satisfying things over time without ever producing a complete answer to ethics. Though of course I could be wrong here; it seems like intuitions vary a lot. It seems more likely to me that we find a simple unifying theory for consciousness than ethics.
It seems like you’re saying here that there won’t be clean rules for determining logical counterfactuals? I agree this might be the case but it doesn’t seem clear to me. Logical counterfactuals seem pretty confusing and there seems to be a lot of room for better theories about them.
Right, and I would argue that logical counterfactuals (in the way we’ve mentioned them in this thread) will necessarily be intractably confusing, because they’re impossible to do cleanly. I say this because in the “P & C” example above, we need a frame-invariant way to interpret a change in C in terms of P. However, we can only have such a frame-invariant way if there exists a clean mapping (injection, surjection, bijection, etc) between P&C- which I think we can’t have, even theoretically.
(Unless we define both physics and computation through something like constructor theory… at which point we’re not really talking about Turing machines as we know them—we’d be talking about physics by another name.)
This is a big part of the reason why I’m a big advocate of trying to define moral value in physical terms: if we start with physics, then we know our conclusions will ‘compile’ to physics. If instead we start with the notion that ‘some computations have more moral value than others’, we’re stuck with the problem—intractable problem, I argue—that we don’t have a frame-invariant way to precisely identify what computations are happening in any physical system (and likewise, which aren’t happening). I.e., statements about computations will never cleanly compile to physical terms. And whenever we have multiple incompatible interpretations, we necessarily get inconsistencies, and we can prove anything is true (i.e., we can prove any arbitrary physical system is superior to any other).
Does that argument make sense?
… that said, it would seem very valuable to make a survey of possible levels of abstraction at which one could attempt to define moral value, and their positives & negatives.
I think we have a lot more theoretical progress to make on understanding consciousness and ethics. On priors I’d expect the theoretical progress to produce more-satisfying things over time without ever producing a complete answer to ethics. Though of course I could be wrong here; it seems like intuitions vary a lot. It seems more likely to me that we find a simple unifying theory for consciousness than ethics.
However, we can only have such a frame-invariant way if there exists a clean mapping (injection, surjection, bijection, etc) between P&C- which I think we can’t have, even theoretically.
I’m still not sure why you strongly think there’s _no_ principled way; it seems hard to prove a negative. I mentioned that we could make progress on logical counterfactuals; there’s also the approach Chalmers talks about here. (I buy that there’s reason to suspect there’s no principled way if you’re not impressed by any proposal so far).
And whenever we have multiple incompatible interpretations, we necessarily get inconsistencies, and we can prove anything is true (i.e., we can prove any arbitrary physical system is superior to any other).
I don’t think this follows. The universal prior is not objective; you can “prove” that any bit probably follows from a given sequence, by changing your reference machine. But I don’t think this is too problematic. We just accept that some things don’t have a super clean objective answer. The reference machines that make odd predictions (e.g. that 000000000 is probably followed by 1) look weird, although it’s hard to precisely say what’s weird about them without making reference to another reference machine. I don’t think this kind of non-objectivity implies any kind of inconsistency.
Similarly, even if objective approaches to computational interpretations fail, we could get a state where computational interpretations are non-objective (e.g. defined relative to a “reference machine”) and the reference machines that make very weird predictions (like the popcorn implementing a cat) would look super weird to humans. This doesn’t seem like a fatal flaw to me, for the same reason it’s not a fatal flaw in the case of the universal prior.
What you’re saying seems very reasonable; I don’t think we differ on any facts, but we do have some divergent intuitions on implications.
I suspect this question—whether it’s possible to offer a computational description of moral value that could cleanly ‘compile’ to physics—would have non-trivial yet also fairly modest implications for most of MIRI’s current work.
I would expect the significance of this question to go up over time, both in terms of direct work MIRI expects to do, and in terms of MIRI’s ability to strategically collaborate with other organizations. I.e., when things shift from “let’s build alignable AGI” to “let’s align the AGI”, it would be very good to have some of this metaphysical fog cleared away so that people could get on the same ethical page, and see that they are in fact on the same page.
Thanks for the response; I’ve found this discussion useful for clarifying and updating my views.
However, when we start talking about mind simulations and ‘thought crime’, WBE, selfish replicators, and other sorts of tradeoffs where there might be unknown unknowns with respect to moral value, it seems clear to me that these issues will rapidly become much more pressing. So, I absolutely believe work on these topics is important, and quite possibly a matter of survival. (And I think it’s tractable, based on work already done.)
Suppose we live under the wrong moral theory for 100 years. Then we figure out the right moral theory, and live according to that one for the rest of time. How much value is lost in that 100 years? It seems very high but not an x-risk. It seems like we only get x-risks if somehow we don’t put a trusted reflection process (e.g. human moral philosophers) in control of the far future.
It seems quite sensible for people who don’t put overwhelming importance on the far future to care about resolving moral uncertainty earlier. The part of my morality that isn’t exclusively concerned with the far future strongly approves of things like consciousness research that resolve moral uncertainty earlier.
Based on my understanding, I don’t think Act-based agents or Task AI would help resolve these questions by default, although as tools they could probably help.
Act-based agents and task AGI kick the problem of global governance to humans. Humans still need to decide questions like how to run governments; they’ll be able to use AGI to help them, but governing well is still a hard problem even with AI assistance. The goal would be that moral errors are temporary; with the right global government structure, moral philosophers will be able to make moral progress and have their moral updates reflected in how things play out.
It’s possible that you think that governing the world well enough that the future eventually reflects human values is very hard even with AGI assistance, and would be made easier with better moral theories made available early on.
One factor that bears mentioning is whether an AGI’s ontology & theory of ethics might be path-dependent upon its creators’ metaphysics in such a way that it would be difficult for it to update if it’s wrong. If this is a plausible concern, this would imply a time-sensitive factor in resolving the philosophical confusion around consciousness, valence, moral value, etc.
I agree with this but place low probability on the antecedent. It’s kind of hard to explain briefly; I’ll point to this comment thread for a good discussion (I mostly agree with Paul).
But now that I think about it more, I don’t put super low probability on the antecedent. It seems like it would be useful to have some way to compare different universes that we’ve failed to put in control of trusted reflection processes, to e.g. get ones that have less horrific suffering or more happiness. I place high probability on “distinguishing between such universes is as hard as solving the AI alignment problem in general”, but I’m not extremely confident of that and I don’t have a super precise argument for it. I guess I wouldn’t personally prioritize such research compared to generic AI safety research but it doesn’t seem totally implausible that resolving moral uncertainty earlier would reduce x-risk for this reason.
I generally agree with this—getting it right eventually is the most important thing; getting it wrong for 100 years could be horrific, but not an x-risk.
I do worry some that “trusted reflection process” is a sufficiently high-level abstraction so as to be difficult to critique.
Interesting piece by Christiano, thanks! I would also point to a remark I made above, that doing this sort of ethical clarification now (if indeed it’s tractable) will pay dividends in aiding coordination between organizations such as MIRI, DeepMind, etc. Or rather, by not clarifying goals, consciousness, moral value, etc, it seems likely to increase risks of racing to be the first to develop AGI, secrecy & distrust between organizations, and such.
clarifying “what should people who gain a huge amount of power through AI do with Earth, existing social structuers, and the universe?” seems like a good question to get agreement on for coordination reasons
we should be looking for tractable ways of answering this question
I think:
a) consciousness research will fail to clarify ethics enough to answer enough of (1) to achieve coordination (since I think human preferences on the relevant timescales are way more complicated than consciousness, conditioned on consciousness being simple). b) it is tractable to answer (1) without reaching agreement on object-level values, by doing something like designing a temporary global government structure that most people agree is pretty good (in that it will allow society to reflect appropriately and determine the next global government structure), but that this question hasn’t been answered well yet and that a better answer would improve coordination. E.g. perhaps society is run as a global federalist democratic-ish structure with centralized control of potentially destructive technology (taking into account “how voters would judge something if they thought longer” rather than “how voters actually judge something”; this might be possible if the AI alignment problem is solved). It seems quite possible to create proposals of this form and critique them.
It seems like we disagree about (a) and this disagreement has been partially hashed out elsewhere, and that it’s not clear we have a strong disagreement about (b).
We would lose a great deal of value by optimizing the universe according to current moral uncertainty, without the opportunity to reflect and become less uncertain over time.
There’s a great deal of reflection necessary to figure out what actions moral theory X recommends, e.g. to figure out which minds exist or what implicit promises people have made to each other. I don’t see this reflection as distinct from reflection about moral uncertainty; if we’re going to defer to a reflection process anyway for making decisions, we might as well let that reflection process decide on issues of moral theory.
What if AI exploring moral uncertainty finds that there is provably no correct moral theory or right moral facts? It that case, there is no moral uncertainty between moral theories, as they are all false. Could it escape this obstacle just by aggregating human’s opinion about possible situations?
What if AI exploring moral uncertainty finds that there is provably no correct moral theory or right moral facts?
In that case it would be exploring traditional metaethics, not moral uncertainty.
But if moral uncertainty is used as a solution then we just bake in some high level criteria for the appropriateness of a moral theory, and the credences will necessarily sum to 1. This is little different from baking in coherent extrapolated volition. In either case the agent is directly motivated to do whatever it is that satisfies our designated criteria, and it will still want to do it regardless of what it thinks about moral realism.
Those criteria might be very vague and philosophical, or they might be very specific and physical (like ‘would a simulation of Bertrand Russell say “a-ha, that’s a good theory”?’), but either way they will be specified.
This is enough to make me discount its value by perhaps one-to-two orders of magnitude.
So you’d put the probability of CEV working at between 90 and 99 percent? 90% seems plausible to me if a little high; 99% seems way too high.
But I have to give you a lot of credit for saying “the possibility of CEV discounts how valuable this is” instead of “this doesn’t matter because CEV will solve it”; many people say the latter, implicitly assuming that CEV has a near-100% probability of working.
So you’d put the probability of CEV working at between 90 and 99 percent?
No, rather lower than that (80%?). But I think that we’re more likely to attain only somewhat-flawed versions of the future without something CEV-ish. This reduces my estimate of the value of getting them kind of right, relative to getting good outcomes through worlds which do achieve something like CEV. I think that probably ex-post provides another very large discount factor, and the significant chance that it does provides another modest ex-ante discount factor (maybe another 80%; none of my numbers here are deeply considered).
Hey I (David Krueger) remember we spoke about this a bit with Toby when I was at FHI this summer.
I think we should be aiming for something like CEV, but we might not get it, and we should definitely consider scenarios where we have to settle for less.
For instance, some value-aligned group might find that its best option (due to competitive pressures) is to create an AI which has a 50% probability of being CEV-like or “aligned via corrigibility”, but has a 50% probability of (effectively) prematurely settling on a utility function whose goodness depends heavily on the nature of qualia.
If (as I believe) such a scenario is likely, then the problem is time-sensitive.
(effectively) prematurely settling on a utility function whose goodness depends heavily on the nature of qualia
This feels extremely unlikely; I don’t think we have plausible paths to obtaining a non-negligibly good outcome without retaining the ability to effectively deliberate about e.g. the nature of qualia. I also suspect that we will be able to solve the control problem, and if we can’t then it will be because of failure modes that can’t be avoided by settling on a utility function. Of course “can’t see any way it can happen” is not the same as “am justifiably confident it won’t happen,” but I think in this case it’s enough to get us to pretty extreme odds.
More precisely, I’d give 100:1 against: (a) we will fail to solve the control problem in a satisfying war, (b) we will fall back to a solution which depends on our current understanding of qualia, (c) the resulting outcome will be non-negligibly good according to our view about qualia at the time that we build AI, and (d) it will be good because we hold that view about qualia.
(My real beliefs might be higher than 1% just based on “I haven’t thought about it very long” and peer disagreement. But I think it’s more likely than not that I would accept a bet at 100:1 odds after deliberation, even given that reasonable people disagree.)
(By non-negligibly good I mean that we would be willing to make some material sacrifice to improve its probability compared to a barren universe, perhaps of $1000/1% increase. By because I mean that the outcome would have been non-negligibly worse according to that view if we had not held it.)
I’m not sure if there is any way to turn the disagreement into a bet. Perhaps picking an arbiter and looking at their views in a decade? (e.g. Toby, Carl Schulman, Wei Dai?) This would obviously involve less extreme odds.
Probably more interesting than betting is resolving the disagreement. This seems to be a slightly persistent disagreement between me and Toby, I have never managed to really understand his position but we haven’t talked about it much. I’m curious about what kind of solutions you see as plausible—it sounds like your view is based on a more detailed picture rather than an “anything might happen” view.
I think I was too terse; let me explain my model a bit more.
I think there’s a decent chance (OTTMH, let’s say 10%) that without any deliberate effort we make an AI which wipes our humanity, but is anyhow more ethically valuable than us (although not more than something which we deliberately design to be ethically valuable). This would happen, e.g. if this was the default outcome (e.g. if it turns out to be the case that intelligence ~ ethical value). This may actually be the most likely path to victory.**
There’s also some chance that all we need to do to ensure that AI has (some) ethical value (e.g. due to having qualia) is X. In that case, we might increase our chance of doing X by understanding qualia a bit better.
Finally, my point was that I can easily imagine a scenario in which our alternatives are:
Build an AI with 50% chance of being aligned, 50% chance of just being an AI (with P(AI has property X) = 90% if we understand qualia better, 10% else)
Allow our competitors to build an AI with ~0% chance of being ethically valuable.
So then we obviously prefer option1, and if we understand qualia better, option 1 becomes better.
* I notice as I type this that this may have some strange consequences RE high-level strategy; e.g. maybe it’s better to just make something intelligent ASAP and hope that it has ethical value, because this reduces its X-risk, and we might not be able to do much to change the distribution of the ethical value the AI we create produces that much anyhow*. I tend to think that we should aim to be very confident that the AI we build is going to have lots of ethical value, but this may only make sense if we have a pretty good chance of succeeding.
Ah, that makes a lot more sense, sorry for misinterpreting you. (I think Toby has a view closer to the one I was responding to, though I suspect I am also oversimplifying his view.)
I agree that there are important philosophical questions that bear on the goodness of building various kinds of (unaligned) AI, and I think that those questions do have impact on what we ought to do. The biggest prize is if it turns out that some kinds of unaligned AI are much better than others, which I think is plausible. I guess we probably have similar views on these issues, modulo me being more optimistic about the prospects for aligned AI.
I don’t think that an understanding of qualia is an important input into this issue though.
For example, from a long-run ethical perspective, whether or not humans have qualia is not especially important, and what mostly matters is human preferences (since those are what shape the future). If you created a race of p-zombies that nevertheless shared our preferences about qualia, I think it would be fine. And “the character of human preferences” is a very different kind of object than qualia. These questions are related in various ways (e.g. our beliefs about qualia are related to our qualia and to philosophical arguments about consciousness), but after thinking about that a little bit I think it is unlikely that the interaction is very important.
To summarize, I do agree that there are time-sensitive ethical questions about the moral value of creating unaligned AI. This was item 1.2 in this list from 4 years ago. I could imagine concluding that the nature of qualia is an important input into this question, but don’t currently believe that.
Thanks for the write-up. I’m excited about people presenting well thought-through cases for the value of different domains.
I want to push back a bit against the claim that the problem is time-sensitive. If we needed to directly specify what we valued to a powerful AI, then it would be crucial that we had a good answer to that by the time we had such an AI. But an alternative to directly specifying what it is that we value is to specify the process for working out what to value (something in the direction of CEV). If we can do this, then we can pass the intellectual work of this research off to the hypothesised AI. And this strategy looks generally very desirable for various robustness reasons.
Putting this together, I think that there is a high probability that consciousness research is not time-critical. This is enough to make me discount its value by perhaps one-to-two orders of magnitude. However, it could remain high-value even given such a discount.
(I agree that in the long run it’s important. I haven’t looked into your work beyond this post, so I don’t (yet) have much of a direct view of how tractable the problem is to your approach. At least I don’t see problems in principle.)
Thanks for the comment! I think the time-sensitivity of this research is an important claim, as you say.
My impression of how MIRI currently views CEV is that it’s ‘a useful intuition pump, but not something we should currently plan to depend on for heavy lifting’. In the last MIRI AMA, Rob noted that
As an intuition pump, rough sketch, or placeholder, I really like CEV. What I’m worried about is that discussion of CEV generally happens in “far mode”, and there’s very probably work that could and should be done now in order to evaluate how plausible CEV is, and explore alternatives. Four reasons not to depend too much on CEV alone:
CEV is really hard. This seems consistent with what Rob & Eliezer have said.
CEV may not be plausible. A failure mode acknowledged in the original document is that preferences may never cohere—but I would add that CEV may simply be too underdefined & ambiguous to be useful in many cases. E.g., a “preference” is a rather leaky abstraction sometimes to begin with. A lot of possibilities look reasonable from far away, but not from up close, and CEV might be one of these.
CEV may give bad answers. It seems entirely possible that any specific implementation of CEV would unavoidably include certain undesirable systemic biases. More troublingly, maybe preference utilitarianism is just a bad way to go about ethics (I think this is true, personally).
Research into qualia may help us get CEV right. If we define the landscape of consciousness as the landscape within which morally-significant things happen, then understanding this landscape better should help us see how CEV could- or couldn’t- help us navigate it.
Aside from these CEV-specific concerns, I think research into consciousness & valence could have larger benefits to AI safety- I wrote up some thoughts on this last year at http://opentheory.net/2015/09/fai_and_valence/ .
Rather than time-sensitivity, another way to frame this could be path-dependence based on order of technological development. Do we get better average & median futures if we attempt to build AI without worrying much about qualia, or if we work on both at once?
(Granted, even if this research is all I say it is, there are potential pitfalls of technological development down this path.)
Some thoughts:
IMO the most plausible non-CEV proposals are
Act-based agents, which defer to humans to a large extent. The goal is to keep humans in control of the future.
Task AI, which is used to accomplish concrete objectives in the world. The idea would be to use this to accomplish goals people would want accomplished using AI (including reducing existential risk), while leaving the future moral trajectory in the hands of humans.
Both proposals end up deferring to humans to decide the long-run trajectory of humanity. IMO, this isn’t a coincidence; I don’t think it’s likely that we get a good outcome without deferring to humans in the long run.
Some more specific comments:
There’s one story where this makes a little bit of sense, where we basically give up on satisfying any human values other than hedonic values, and build an AI that maximizes pleasure without satisfying any other human values. I’m skeptical that this is any easier than solving the full value alignment problem, but even if it were, I think this would be undesirable to the vast majority of humans, and so we would collectively be better off coordinating around a higher target.
If we’re shooting for a higher target, then we have some story for why we get more values than just hedonic values. E.g. the AI defers to human moral philosophers on some issues. But this method should also succeed for loading hedonic values. So there isn’t a significant benefit to having hedonic values specified ahead of time.
This seems to be in the same reference class as asking questions like “how many humans exist” or “what’s the closing price of the Dow Jones”. I.e. you can use it to check if things are going as expected, though the metric can be manipulated. Personally I’m pessimistic about such sanity checks in general, and even if I were optimistic about them, I would think that the marginal value of one additional sanity check is low.
See Eliezer’s thoughts on mindcrime. Also see the discussion in the comments. It does seem like consciousness research could help for defining a nonpersonhood predicate.
I don’t have comments on cognitive enhancement since it’s not my specialty.
Some of the points (6,7,8) seem most relevant if we expect AGI to be designed to use internal reinforcement substantially similar to humans’ internal reinforcement and substantially different from modern reinforcement learning. I don’t have precise enough models of such AGI systems that I feel optimistic about doing research related to such AGIs, but if you think questions like “how would we incentivize neuromorphic AI systems to do what we want” are tractable then maybe it makes sense for you to do research on this question. I’m pessimistic about things in the reference class of IIT making any progress on this question, but maybe you have different models here.
I agree that “Valence research could change the social and political landscape AGI research occurs in” and, like you, I think the sign is unclear.
(I am a MIRI research fellow but am currently speaking for myself not my employer).
Hi Jessica,
Thanks for the thoughtful note. I do want to be very clear that I’m not criticizing MIRI’s work on CEV, which I do like very much! - It seems like the best intuition pump & Schelling Point in its area, and I think it has potential to be more.
My core offering in this space (where I expect most of the value to be) is Principia Qualia- it’s more up-to-date and comprehensive than the blog post you’re referencing. I pose some hypotheticals in the blog post, but it isn’t intended to stand alone as a substantive work (whereas PQ is).
But I had some thoughts in response to your response on valence + AI safety:
->1. First, I agree that leaving our future moral trajectory in the hands of humans is a great thing. I’m definitely not advocating anything else.
->2. But I would push back on whether our current ethical theories are very good- i.e., good enough to see us through any future AGI transition without needlessly risking substantial amounts of value.
To give one example: currently, some people make the claim that animals such as cows are much more capable of suffering than humans, because they don’t have much intellect to blunt their raw, emotional feeling. Other people make the claim that cows are much less capable of suffering than humans, because they don’t have the ‘bootstrapping strange loop’ mind architecture enabled by language, and necessary for consciousness. Worryingly, both of these arguments seem plausible, with no good way to pick between them.
Now, I don’t think cows are in a strange quantum superposition of both suffering and not suffering— I think there’s a fact of the matter, though we clearly don’t know it.
This example may have moral implications, but little relevance to existential risk. However, when we start talking about mind simulations and ‘thought crime’, WBE, selfish replicators, and other sorts of tradeoffs where there might be unknown unknowns with respect to moral value, it seems clear to me that these issues will rapidly become much more pressing. So, I absolutely believe work on these topics is important, and quite possibly a matter of survival. (And I think it’s tractable, based on work already done.)
Based on my understanding, I don’t think Act-based agents or Task AI would help resolve these questions by default, although as tools they could probably help.
->3. I also think theories in IIT’s reference class won’t be correct, but I suspect I define the reference class much differently. :) Based on my categorization, I would object to lumping my theory into IIT’s reference class (we could talk more about this if you’d like).
->4. Re: suffering computations- a big, interesting question here is whether moral value should be defined at the physical or computational level. I.e., “is moral value made out of quarks or bits (or something else)?” — this may be the crux of our disagreement, since I’m a physicalist and I gather you’re a computationalist. But PQ’s framework allows for bits to be “where the magic happens”, as long as certain conditions obtain.
One factor that bears mentioning is whether an AGI’s ontology & theory of ethics might be path-dependent upon its creators’ metaphysics in such a way that it would be difficult for it to update if it’s wrong. If this is a plausible concern, this would imply a time-sensitive factor in resolving the philosophical confusion around consciousness, valence, moral value, etc.
->5. I wouldn’t advocate strictly hedonic values (this was ambiguous in the blog post but is clearer in Principia Qualia).
->6. However, I do think that “how much horrific suffering is there in possible world X?” is a hands-down, qualitatively better proxy for whether it’s a desirable future than “what is the Dow Jones closing price in possible world X?”
->7. Re: neuromorphic AIs: I think an interesting angle here is, “how does boredom stop humans from wireheading on pleasurable stimuli?”—I view boredom as a sophisticated anti-wireheading technology. It seems possible (although I can’t vouch for plausible yet) that if we understand the precise mechanism by which boredom is implemented in human brains, it may help us understand and/or control neuromorphic AGIs better. But this is very speculative, and undeveloped.
I’m curious about this, since you mentioned fixing IIT’s flaws. I came to the comments to make the same complaint you were responding to Jessica about.
I had the same response. The document claims that pleasure or positive valence corresponds to symmetry.
This does not look like a metric that is tightly connected to sensory, cognitive, or behavioral features. In particular, it is not specifically connected to liking, wanting, aversion, and so forth. So, like IIT in the cases discussed by Scott Aaronson, it would seem likely to assign huge values (of valence rather than consciousness, in this case) to systems that lack the corresponding functions, and very low values to systems that possess them.
The document is explicit about qualia not being strictly linked to the computational and behavioral functions that lead us to, e.g. talk about qualia or withdraw from painful stimuli:
The falsifiable predictions are mostly claims that the computational functions will be (imperfectly) correlated with symmetry, but the treatment of boredom appears to allow that these will be quite imperfect:
Overall, this seems systematically analogous to IIT in its flaws. If one wanted to pursue an analogy to Aaronson’s discussion of trivial expander graphs producing extreme super-consciousness, one could create an RL agent (perhaps in an artificial environment where it has the power to smile, seek out rewards, avoid injuries (which trigger negative reward), favor injured limbs, and consume painkillers (which stop injuries from generating negative reward) whose symmetry could be measured in whatever way the author would like to specify.
I think we can say now that we could program the agent in such a way that it sought out things that resulted in either more or less symmetric states, or was neutral to such things. Likewise, switching the signs of rewards would not reliably switch the associated symmetry. And its symmetry could be directly and greatly altered without systematic matching behavioral changes.
I would like to know whether the theory in PQ is supposed to predict that such agents couldn’t be built without extraordinary efforts, or that they would have systematic mismatch of their functional beliefs and behavior regarding qualia with actual qualia.
Hi Carl, thanks for your thoughts & time. I appreciate the comments.
First, to be clear, the hypothesis is that the symmetry of the mathematical object isomorphic to a conscious experience corresponds to valence. This is distinct from (although related to) the symmetry of a stimulus, or even symmetry within brain networks.
I strongly disagree with this in the case of humans, fairly strongly disagree in the more general case of evolved systems, and mildly disagree in the fully general case of arbitrary systems.
First, it seems extremely like to me that evolved organisms would use symmetry as an organizational principle / attractor (Section XII);
Second, in cases where we do have some relevant data or plausible models (I.e., as noted in Sections IX and XII), the symmetry hypothesis seems plausible. I think the hypothesis does really well when one actually looks at the object-level, particularly e.g., Safron’s model of orgasm & Seth and Friston’s model of interoception;
Third, with respect to extending Aaronson’s critique, I question whether “this seems to give weird results when put in novel contexts” is a good path to take. As Eric Schwitzgebel notes, “Common sense is incoherent in matters of metaphysics. There’s no way to develop an ambitious, broad-ranging, self- consistent metaphysical system without doing serious violence to common sense somewhere. It’s just impossible. Since common sense is an inconsistent system, you can’t respect it all. Every metaphysician will have to violate it somewhere.” This seems particularly true in the realm of consciousness, and particularly true in contexts where there was no evolutionary benefit in having correct intuitions.
As such it seems important not to enshrine common sense, with all its inconsistencies, as the gold standard with regard to valence research. In general, I’d say a good sign of a terrible model of consciousness would be that it validates all of our common-sense intuitions about the topic.
Section XI is intended as the core set of falsifiable predictions—you may be thinking of the ‘implications for neuroscience’ discussion in Section XII, some of which could be extended to become falsifiable predictions.
I’d assert- very strongly- that one could not evolve such a suffering-seeking agent without extraordinary effort, and that if one was to attempt to build one from scratch, it would be orders of magnitude more difficult to do so than making a “normal” agent. (This follows from my reasoning in Section XII.) But let’s keep in mind that whether the agent you’re speaking of is a computational program or a physical system matters a lot—under my model, a RL agent running on a standard Von Neumann physical architecture probably has small & merely fragmentary qualia.
An analogy here would be the orthogonality thesis: perhaps we can call this “valence orthogonality”: the behavior of a system, and its valence, are usually tightly linked via evolutionary processes and optimization factors, but they are not directly causally coupled, just as intelligence & goals are not causally coupled.
This hypothesis does also have implications for the qualia of whole-brain emulations, which perhaps is closer to your thought-experiment.
As I understand their position, MIRI tends to not like IIT because it’s insufficiently functionalist—and too physicalist. On the other hand, I don’t think IIT could be correct because it’s too functionalist—and insufficiently physicalist, partially for the reasons I explain in my response to Jessica.
The core approach I’ve taken is to enumerate the sorts of problems one would need to solve if one was to formalize consciousness. (Whether consciousness is a thing-that-can-be-formalized is another question, of course.) My analysis is that IIT satisfactorily addresses 4 or 5, out of the 8 problems. Moving to a more physical basis would address more of these problems, though not all (a big topic in PQ is how to interpret IIT-like output, which is an independent task of how to generate it).
Other research along these same lines would be e.g.,
->Adam Barrett’s FIIH: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912322/
->Max Tegmark’s Perceptronium: https://arxiv.org/abs/1401.1219
some more object-level comments on PQ itself:
Suppose we have a Python program running on a computer. Truths about the Python program are, in some sense, reducible to physics; however the reduction itself requires resolving philosophical questions. I don’t know if this means the Python program’s functioning (e.g. values of different variables at different times) are “strongly emergent”; it doesn’t seem like an important question to me.
In the case of the Python program this seems clearly false (it’s consistent to view the system as a physical system without reference to the Python program). I expect this is also false in the case of consciousness. I think almost all computationalists would strongly reject downwards causation according to this definition. Do you know of any computationalists who actually advocate downwards causation (i.e. that you can’t predict future physical states by just looking at past physical states without thinking about the higher levels)?
IMO consciousness has power over physics the same way the Python program has power over physics; we can consider counterfactuals like “what if this variable in the Python program magically had a different value” and ask what would happen to physics if this happened (in this case, maybe the variable controls something displayed on a computer screen, so if the variable were changed then the computer screen would emit different light). Actually formalizing questions like “what would happen if this variable had a different value” requires a theory of logical counterfactuals (which MIRI is researching, see this paper).
Notably, Python programs usually don’t “make choices” such that “control” is all that meaningful, but humans do. Here I would say that humans implement a decision theory, while most Python programs do not (although some Python programs do implement a decision theory and can be meaningfully said to “make choices”). “Implementing a decision theory” means something like “evaluating multiple actions based on what their effects are expected to be, and choosing the one that scores best according to some metric”; some AI systems like reinforcement learners implement a decision theory.
(I’m writing this comment to express more “computationalism has a reasonable steelman that isn’t identified as a possible position in PQ” rather than “computationalism is clearly right”)
(more comments)
It seems like this means that empirical tests (e.g. neuroscience stuff) aren’t going to help test aspects of the theory that are about divergence between computational pseudo-qualia (the things people report on) and actual qualia. If I squint a lot I could see “anthropic evidence” being used to distinguish between pseudo-qualia and qualia, but it seems like nothing else would work.
I’m also not sure why we would expect pseudo-qualia to have any correlation with actual qualia? I guess you could make an anthropic argument (we’re viewing the world from the perspective of actual qualia, and our sensations seem to match the pseudo-qualia). That would give someone the suspicion that there’s some causal story for why they would be synchronized, without directly providing such a causal story.
(For the record I think anthropic reasoning is usually confused and should be replaced with decision-theoretic reasoning (e.g. see this discussion), but this seems like a topic for another day)
Yes, the epistemological challenges with distinguishing between ground-truth qualia and qualia reports are worrying. However, I don’t think they’re completely intractable, because there is a causal chain (from Appendix C):
Our brain’s physical microstates (perfectly correlated with qualia) --> The logical states of our brain’s self-model (systematically correlated with our brain’s physical microstates) --> Our reports about our qualia (systematically correlated with our brain’s model of its internal state)
.. but there could be substantial blindspots, especially in contexts where there was no adaptive benefit to having accurate systematic correlations.
Awesome, I do like your steelman. More thoughts later, but just wanted to share one notion before sleep:
With regard to computationalism, I think you’ve nailed it. Downward causation seems pretty obviously wrong (and I don’t know of any computationalists that personally endorse it).
Totally agreed, and I like this example.
Right- but I would go even further. Namely, given any non-trivial physical system, there exists multiple equally-valid interpretations of what’s going on at the computational level. The example I give in PQ is: let’s say I shake a bag of popcorn. With the right mapping, we could argue that we could treat that physical system as simulating the brain of a sleepy cat. However, given another mapping, we could treat that physical system as simulating the suffering of five holocausts. Very worryingly, we have no principled way to choose between these interpretive mappings. Am I causing suffering by shaking that bag of popcorn?
And I think all computation is like this, if we look closely- there exists no frame-invariant way to map between computation and physical systems in a principled way… just useful mappings, and non-useful mappings (and ‘useful’ is very frame-dependent).
This introduces an inconsistency into computationalism, and has some weird implications: I suspect that, given any computational definition of moral value, there would be a way to prove any arbitrary physical system morally superior to any other arbitrary physical system. I.e., you could prove both that A>B, and B>A.
… I may be getting something wrong here. But it seems like the lack of a clean quarksbits mapping ultimately turns out to be a big deal, and is a big reason why I advocate not trying to define moral value in terms of Turing machines & bitstrings.
Instead, I tend to think of ethics as “how should we arrange the [quarks|negentropy] in our light-cone?”—ultimately we live in a world of quarks, so ethics is a question of quarks (or strings, or whatnot).
However! Perhaps this is just a failure of my imagination. What is ethics if not how to arrange our physical world? Or can you help me steelman computationalism against this inconsistency objection?
Thanks again for the comments. They’re both great and helpful.
Thanks for your comments too, I’m finding them helpful for understanding other possible positions on ethics.
OK, how about a rule like this:
(formalizing this rule would require a theory of logical counterfactuals; I’m not sure if I expect a fully general theory to exist but it seems plausible that one does)
I’m not asserting that this rule is correct but it doesn’t seem inconsistent. In particular it doesn’t seem like you could use it to prove A > B and B > A. And clearly your popcorn embeds neither a cat nor the suffering of five holocausts under this rule.
If it turns out that no simple rule of this form works, I wouldn’t be too troubled, though; I’d be psychologically prepared to accept that there isn’t a clean quarkscomputations mapping. Similar to how I already accept that human value is complex, I could accept that human judgments of “does this physical system implement this computation” are complex (and thus can’t be captured in a simple rule). I don’t think this would make me inconsistent, I think it would just make me more tolerant of nebulosity in ethics. At the moment it seems like clean mappings might exist and so it makes sense to search for them, though.
On the object level, it seems like it’s possible to think of painting as “how should we arrange the brush strokes on the canvas?”. But it seems hard to paint well while only thinking at the level of brush strokes (and not thinking about the higher levels, like objects). I expect ethics to be similar; at the very least if human ethics has an “aesthetic” component then it seems like designing a good light cone is at least as hard as making a good painting. Maybe this is a strawman of your position?
On the meta level, I would caution against this use of “ultimately”; see here and here (the articles are worded somewhat disagreeably but I mostly endorse the content). In some sense ethics is about quarks, but in other senses it’s about:
computations
aesthetics
the id, ego, and superego
deciding which side to take in a dispute
a conflict between what we want and what we want to appear to want
nurturing the part of us that cares about others
updateless decision theory
a mathematical fact about what we would want upon reflection
I think these are all useful ways of viewing ethics, and I don’t feel the need to pick a single view (although I often find it appealing to look at what some views say about what other views are saying and resolving the contradictions between them). There are all kinds of reasons why it might be psychologically uncomfortable not to have a simple theory of ethics (e.g. it’s harder to know whether you’re being ethical, it’s harder to criticize others for being unethical, it’s harder for groups to coordinate around more complex and ambiguous ethical theories, you’ll never be able to “solve” ethics once and then never have to think about ethics again, it requires holding multiple contradictory views in your head at once, you won’t always have a satisfying verbal justification for why your actions are ethical). But none of this implies that it’s good (in any of the senses above!) to assume there’s a simple ethical theory.
(For the record I think it’s useful to search for simple ethical theories even if they don’t exist, since you might discover interesting new ways of viewing ethics, even if these views aren’t complete).
I suspect this still runs into the same problem—in the case of the computational-physical mapping, even if we assert that C has changed, we can merely choose a different interpretation of P which is consistent with the change, without actually changing P.
This is an important question: if there exists no clean quarkscomputations mapping, is it (a) a relatively trivial problem, or (b) a really enormous problem? I’d say the answer to this depends on how we talk about computations. I.e., if we say “the ethically-relevant stuff happens at the computational level”—e.g., we shouldn’t compute certain strings—then I think it grows to be a large problem. This grows particularly large if we’re discussing how to optimize the universe! :)
Let me push back a little here- imagine we live in the early 1800s, and Faraday was attempting to formalize electromagnetism. We had plenty of intuitive rules of thumb for how electromagnetism worked, but no consistent, overarching theory. I’m sure lots of people shook their head and said things like, “these things are just God’s will, there’s no pattern to be found.” However, it turns out that there was something unifying to be found, and tolerance of inconsistencies & nebulosity would have been counter-productive.
Today, we have intuitive rules of thumb for how we think consciousness & ethics work, but similarly no consistent, overarching theory. Are consciousness & moral value like electromagnetism—things that we can discover knowledge about? Or are they like elan vital—reifications of clusters of phenomena that don’t always cluster cleanly?
I think the jury’s still out here, but the key with electromagnetism was that Faraday was able to generate novel, falsifiable predictions with his theory. I’m not claiming to be Faraday, but I think if we can generate novel, falsifiable predictions with work on consciousness & valence (I offer some in Section XI, and observations that could be adapted to make falsifiable predictions in Section XII), this should drive updates toward “there’s some undiscovered cache of predictive utility here, similar to what Faraday found with electromagnetism.”
It seems like you’re saying here that there won’t be clean rules for determining logical counterfactuals? I agree this might be the case but it doesn’t seem clear to me. Logical counterfactuals seem pretty confusing and there seems to be a lot of room for better theories about them.
I agree that it would a large problem. The total amount of effort to “complete” the project of figuring out which computations we care about would be practically infinite, but with a lot of effort we’d get better and better approximations over time, and we would be able to capture a lot of moral value this way.
I mostly agree with your push back; I think when we have different useful views of the same thing that’s a good indication that there’s more intellectual progress to be made in resolving the contradictions between the different views (e.g. by finding a unifying theory).
I think we have a lot more theoretical progress to make on understanding consciousness and ethics. On priors I’d expect the theoretical progress to produce more-satisfying things over time without ever producing a complete answer to ethics. Though of course I could be wrong here; it seems like intuitions vary a lot. It seems more likely to me that we find a simple unifying theory for consciousness than ethics.
Right, and I would argue that logical counterfactuals (in the way we’ve mentioned them in this thread) will necessarily be intractably confusing, because they’re impossible to do cleanly. I say this because in the “P & C” example above, we need a frame-invariant way to interpret a change in C in terms of P. However, we can only have such a frame-invariant way if there exists a clean mapping (injection, surjection, bijection, etc) between P&C- which I think we can’t have, even theoretically.
(Unless we define both physics and computation through something like constructor theory… at which point we’re not really talking about Turing machines as we know them—we’d be talking about physics by another name.)
This is a big part of the reason why I’m a big advocate of trying to define moral value in physical terms: if we start with physics, then we know our conclusions will ‘compile’ to physics. If instead we start with the notion that ‘some computations have more moral value than others’, we’re stuck with the problem—intractable problem, I argue—that we don’t have a frame-invariant way to precisely identify what computations are happening in any physical system (and likewise, which aren’t happening). I.e., statements about computations will never cleanly compile to physical terms. And whenever we have multiple incompatible interpretations, we necessarily get inconsistencies, and we can prove anything is true (i.e., we can prove any arbitrary physical system is superior to any other).
Does that argument make sense?
… that said, it would seem very valuable to make a survey of possible levels of abstraction at which one could attempt to define moral value, and their positives & negatives.
Totally agreed!
I’m still not sure why you strongly think there’s _no_ principled way; it seems hard to prove a negative. I mentioned that we could make progress on logical counterfactuals; there’s also the approach Chalmers talks about here. (I buy that there’s reason to suspect there’s no principled way if you’re not impressed by any proposal so far).
I don’t think this follows. The universal prior is not objective; you can “prove” that any bit probably follows from a given sequence, by changing your reference machine. But I don’t think this is too problematic. We just accept that some things don’t have a super clean objective answer. The reference machines that make odd predictions (e.g. that 000000000 is probably followed by 1) look weird, although it’s hard to precisely say what’s weird about them without making reference to another reference machine. I don’t think this kind of non-objectivity implies any kind of inconsistency.
Similarly, even if objective approaches to computational interpretations fail, we could get a state where computational interpretations are non-objective (e.g. defined relative to a “reference machine”) and the reference machines that make very weird predictions (like the popcorn implementing a cat) would look super weird to humans. This doesn’t seem like a fatal flaw to me, for the same reason it’s not a fatal flaw in the case of the universal prior.
What you’re saying seems very reasonable; I don’t think we differ on any facts, but we do have some divergent intuitions on implications.
I suspect this question—whether it’s possible to offer a computational description of moral value that could cleanly ‘compile’ to physics—would have non-trivial yet also fairly modest implications for most of MIRI’s current work.
I would expect the significance of this question to go up over time, both in terms of direct work MIRI expects to do, and in terms of MIRI’s ability to strategically collaborate with other organizations. I.e., when things shift from “let’s build alignable AGI” to “let’s align the AGI”, it would be very good to have some of this metaphysical fog cleared away so that people could get on the same ethical page, and see that they are in fact on the same page.
Thanks for the response; I’ve found this discussion useful for clarifying and updating my views.
Suppose we live under the wrong moral theory for 100 years. Then we figure out the right moral theory, and live according to that one for the rest of time. How much value is lost in that 100 years? It seems very high but not an x-risk. It seems like we only get x-risks if somehow we don’t put a trusted reflection process (e.g. human moral philosophers) in control of the far future.
It seems quite sensible for people who don’t put overwhelming importance on the far future to care about resolving moral uncertainty earlier. The part of my morality that isn’t exclusively concerned with the far future strongly approves of things like consciousness research that resolve moral uncertainty earlier.
Act-based agents and task AGI kick the problem of global governance to humans. Humans still need to decide questions like how to run governments; they’ll be able to use AGI to help them, but governing well is still a hard problem even with AI assistance. The goal would be that moral errors are temporary; with the right global government structure, moral philosophers will be able to make moral progress and have their moral updates reflected in how things play out.
It’s possible that you think that governing the world well enough that the future eventually reflects human values is very hard even with AGI assistance, and would be made easier with better moral theories made available early on.
I agree with this but place low probability on the antecedent. It’s kind of hard to explain briefly; I’ll point to this comment thread for a good discussion (I mostly agree with Paul).
But now that I think about it more, I don’t put super low probability on the antecedent. It seems like it would be useful to have some way to compare different universes that we’ve failed to put in control of trusted reflection processes, to e.g. get ones that have less horrific suffering or more happiness. I place high probability on “distinguishing between such universes is as hard as solving the AI alignment problem in general”, but I’m not extremely confident of that and I don’t have a super precise argument for it. I guess I wouldn’t personally prioritize such research compared to generic AI safety research but it doesn’t seem totally implausible that resolving moral uncertainty earlier would reduce x-risk for this reason.
I generally agree with this—getting it right eventually is the most important thing; getting it wrong for 100 years could be horrific, but not an x-risk.
I do worry some that “trusted reflection process” is a sufficiently high-level abstraction so as to be difficult to critique.
Interesting piece by Christiano, thanks! I would also point to a remark I made above, that doing this sort of ethical clarification now (if indeed it’s tractable) will pay dividends in aiding coordination between organizations such as MIRI, DeepMind, etc. Or rather, by not clarifying goals, consciousness, moral value, etc, it seems likely to increase risks of racing to be the first to develop AGI, secrecy & distrust between organizations, and such.
A lot does depend on tractability.
I agree that:
clarifying “what should people who gain a huge amount of power through AI do with Earth, existing social structuers, and the universe?” seems like a good question to get agreement on for coordination reasons
we should be looking for tractable ways of answering this question
I think:
a) consciousness research will fail to clarify ethics enough to answer enough of (1) to achieve coordination (since I think human preferences on the relevant timescales are way more complicated than consciousness, conditioned on consciousness being simple).
b) it is tractable to answer (1) without reaching agreement on object-level values, by doing something like designing a temporary global government structure that most people agree is pretty good (in that it will allow society to reflect appropriately and determine the next global government structure), but that this question hasn’t been answered well yet and that a better answer would improve coordination. E.g. perhaps society is run as a global federalist democratic-ish structure with centralized control of potentially destructive technology (taking into account “how voters would judge something if they thought longer” rather than “how voters actually judge something”; this might be possible if the AI alignment problem is solved). It seems quite possible to create proposals of this form and critique them.
It seems like we disagree about (a) and this disagreement has been partially hashed out elsewhere, and that it’s not clear we have a strong disagreement about (b).
What about moral uncertainty as an alternative to CEV? (https://nebula.wsimg.com/1cc278bf0e7470c060032c9624508149?AccessKeyId=07941C4BD630A320288F&disposition=0&alloworigin=1)
I expect:
We would lose a great deal of value by optimizing the universe according to current moral uncertainty, without the opportunity to reflect and become less uncertain over time.
There’s a great deal of reflection necessary to figure out what actions moral theory X recommends, e.g. to figure out which minds exist or what implicit promises people have made to each other. I don’t see this reflection as distinct from reflection about moral uncertainty; if we’re going to defer to a reflection process anyway for making decisions, we might as well let that reflection process decide on issues of moral theory.
What if AI exploring moral uncertainty finds that there is provably no correct moral theory or right moral facts? It that case, there is no moral uncertainty between moral theories, as they are all false. Could it escape this obstacle just by aggregating human’s opinion about possible situations?
In that case it would be exploring traditional metaethics, not moral uncertainty.
But if moral uncertainty is used as a solution then we just bake in some high level criteria for the appropriateness of a moral theory, and the credences will necessarily sum to 1. This is little different from baking in coherent extrapolated volition. In either case the agent is directly motivated to do whatever it is that satisfies our designated criteria, and it will still want to do it regardless of what it thinks about moral realism.
Those criteria might be very vague and philosophical, or they might be very specific and physical (like ‘would a simulation of Bertrand Russell say “a-ha, that’s a good theory”?’), but either way they will be specified.
So you’d put the probability of CEV working at between 90 and 99 percent? 90% seems plausible to me if a little high; 99% seems way too high.
But I have to give you a lot of credit for saying “the possibility of CEV discounts how valuable this is” instead of “this doesn’t matter because CEV will solve it”; many people say the latter, implicitly assuming that CEV has a near-100% probability of working.
No, rather lower than that (80%?). But I think that we’re more likely to attain only somewhat-flawed versions of the future without something CEV-ish. This reduces my estimate of the value of getting them kind of right, relative to getting good outcomes through worlds which do achieve something like CEV. I think that probably ex-post provides another very large discount factor, and the significant chance that it does provides another modest ex-ante discount factor (maybe another 80%; none of my numbers here are deeply considered).
Hey I (David Krueger) remember we spoke about this a bit with Toby when I was at FHI this summer.
I think we should be aiming for something like CEV, but we might not get it, and we should definitely consider scenarios where we have to settle for less.
For instance, some value-aligned group might find that its best option (due to competitive pressures) is to create an AI which has a 50% probability of being CEV-like or “aligned via corrigibility”, but has a 50% probability of (effectively) prematurely settling on a utility function whose goodness depends heavily on the nature of qualia.
If (as I believe) such a scenario is likely, then the problem is time-sensitive.
This feels extremely unlikely; I don’t think we have plausible paths to obtaining a non-negligibly good outcome without retaining the ability to effectively deliberate about e.g. the nature of qualia. I also suspect that we will be able to solve the control problem, and if we can’t then it will be because of failure modes that can’t be avoided by settling on a utility function. Of course “can’t see any way it can happen” is not the same as “am justifiably confident it won’t happen,” but I think in this case it’s enough to get us to pretty extreme odds.
More precisely, I’d give 100:1 against: (a) we will fail to solve the control problem in a satisfying war, (b) we will fall back to a solution which depends on our current understanding of qualia, (c) the resulting outcome will be non-negligibly good according to our view about qualia at the time that we build AI, and (d) it will be good because we hold that view about qualia.
(My real beliefs might be higher than 1% just based on “I haven’t thought about it very long” and peer disagreement. But I think it’s more likely than not that I would accept a bet at 100:1 odds after deliberation, even given that reasonable people disagree.)
(By non-negligibly good I mean that we would be willing to make some material sacrifice to improve its probability compared to a barren universe, perhaps of $1000/1% increase. By because I mean that the outcome would have been non-negligibly worse according to that view if we had not held it.)
I’m not sure if there is any way to turn the disagreement into a bet. Perhaps picking an arbiter and looking at their views in a decade? (e.g. Toby, Carl Schulman, Wei Dai?) This would obviously involve less extreme odds.
Probably more interesting than betting is resolving the disagreement. This seems to be a slightly persistent disagreement between me and Toby, I have never managed to really understand his position but we haven’t talked about it much. I’m curious about what kind of solutions you see as plausible—it sounds like your view is based on a more detailed picture rather than an “anything might happen” view.
I think I was too terse; let me explain my model a bit more.
I think there’s a decent chance (OTTMH, let’s say 10%) that without any deliberate effort we make an AI which wipes our humanity, but is anyhow more ethically valuable than us (although not more than something which we deliberately design to be ethically valuable). This would happen, e.g. if this was the default outcome (e.g. if it turns out to be the case that intelligence ~ ethical value). This may actually be the most likely path to victory.**
There’s also some chance that all we need to do to ensure that AI has (some) ethical value (e.g. due to having qualia) is X. In that case, we might increase our chance of doing X by understanding qualia a bit better.
Finally, my point was that I can easily imagine a scenario in which our alternatives are:
Build an AI with 50% chance of being aligned, 50% chance of just being an AI (with P(AI has property X) = 90% if we understand qualia better, 10% else)
Allow our competitors to build an AI with ~0% chance of being ethically valuable.
So then we obviously prefer option1, and if we understand qualia better, option 1 becomes better.
* I notice as I type this that this may have some strange consequences RE high-level strategy; e.g. maybe it’s better to just make something intelligent ASAP and hope that it has ethical value, because this reduces its X-risk, and we might not be able to do much to change the distribution of the ethical value the AI we create produces that much anyhow*. I tend to think that we should aim to be very confident that the AI we build is going to have lots of ethical value, but this may only make sense if we have a pretty good chance of succeeding.
Ah, that makes a lot more sense, sorry for misinterpreting you. (I think Toby has a view closer to the one I was responding to, though I suspect I am also oversimplifying his view.)
I agree that there are important philosophical questions that bear on the goodness of building various kinds of (unaligned) AI, and I think that those questions do have impact on what we ought to do. The biggest prize is if it turns out that some kinds of unaligned AI are much better than others, which I think is plausible. I guess we probably have similar views on these issues, modulo me being more optimistic about the prospects for aligned AI.
I don’t think that an understanding of qualia is an important input into this issue though.
For example, from a long-run ethical perspective, whether or not humans have qualia is not especially important, and what mostly matters is human preferences (since those are what shape the future). If you created a race of p-zombies that nevertheless shared our preferences about qualia, I think it would be fine. And “the character of human preferences” is a very different kind of object than qualia. These questions are related in various ways (e.g. our beliefs about qualia are related to our qualia and to philosophical arguments about consciousness), but after thinking about that a little bit I think it is unlikely that the interaction is very important.
To summarize, I do agree that there are time-sensitive ethical questions about the moral value of creating unaligned AI. This was item 1.2 in this list from 4 years ago. I could imagine concluding that the nature of qualia is an important input into this question, but don’t currently believe that.
For me at least, immediate cause prioritization questions partially depend upon answers to the question of consciousness.