Thank you for the very interesting post! I agree with most of what you’re saying here.
So what is your hypothesis as to why psychopaths don’t currently totally control and dominate society (or do you believe they actually do?)?
Is it because:
“you can manipulate a psychopath by appealing to their desires” which gives you a way to beat them?
they eventually die (before they can amass enough power to take over the world)?
they ultimately don’t work well together because they’re just looking out for themselves, so have no strength in numbers?
they take over whole countries, but there are other countries banded together to defend against them (non-psychopaths hold psychopaths at bay through strength in numbers)?
something else?
Of course, even if the psychopaths among us haven’t (yet) won the ultimate battle for control doesn’t mean psychopathic AGI won’t in the future.
I take the following message from your presentation of the material: “we’re screwed, and there’s no hope.” Was that your intent?
I prefer the following message: “the chances of success with guardian AGI’s may be small, or even extremely small, but such AGI’s may also be the only real chance we’ve got, so let’s go at developing them with full force.” Maybe we should have a Manhattan project on developing “moral” AGI’s?
Here are some arguments that tend toward a slightly more optimistic take than you gave:
Yes, guardian AGI’s will have the disadvantage of constraints compared to “psychopathic” AGI, but if there are enough guardians, perhaps they can (mostly) keep the psychopathic AGI’s at bay through strength in numbers (how exactly the defense-offense balance works out may be key for this, especially because psychopathic AGI’s could form (temporary) alliances as well)
Although it may seem very difficult to figure out how to make moral AGI’s, as AI’s get better, they should increase our chances of being able to figure this out with their help—particularly if people focus specifically on developing AI systems for this purpose (such as through a moral AGI Manhattan project)
Thanks for sharing this interesting Draft Amnesty post. I’ve been thinking a lot about these sorts of things, and want to make a couple of points that may or may not relate to your current beliefs/understandings (I think they’ll relate to someone’s):
Any theory of consequentialism that doesn’t take into account the effects of our actions/inactions on our consciences and thus our well-beings is an incomplete theory of consequentialism (it doesn’t include all consequences). By considering conscience effects, a difference between killing and letting die becomes apparent.
I personally like the “limited number of bets in our lifetimes” argument against following a decision theory fanatically dependent on expected value calculations, i.e., even when probabilities are super low. Basically, if I could make on the order of 10^20 bets in my lifetime, it might make sense to take a bet at a chance of 1 in 10^20 because eventually I’d end up winning, but since I’ll never live long enough to make 10^20 such bets, I shouldn’t take this one bet.
I think there are two concepts one could consider for responsibility for damages: one is for who’s responsible to pay for the damages, and one is for whether someone feels responsible in their conscience for the damages. The first would be affected by how many other people are involved such as if 3 of us pushed a car off a cliff, I might be responsible to pay for 1/3rd of the damages, or even up to the full damages if the other 2 people didn’t have the ability to pay. The second would be affected by if I thought I significantly directly contributed to at least some fraction of the damages, no matter how many other people were involved. Under this second concept of responsibility, I may choose not to eat meat because if I did, I’d feel that I was significantly contributing to the pain and killing of some amount of animals on factory farms.
Thanks for the post, you bring up some interesting points. I think one of the key things that’s missing from Singer’s approach is just how important personal responsibility is to well-being. Unfortunately, I don’t have my alternative framework all figured out yet, but here’s a start towards it. One example is that we have most responsibility for our own children since we brought them into existence and they generally can’t fend for themselves, so, under many circumstances, giving them priority is the most overall well-being-promoting thing to do.
I’m glad to see you are questioning some of the philosophy behind EA, and I hope that more people will do so. I believe a shift to protecting rights (e.g., fighting corruption) and promoting responsibility (of which mental health is a big subset since it involves taking responsibility for your emotions) could potentially help make EA as a movement much more effective.
Thank you for this interesting post, even though I don’t agree with your conclusions.
I believe one key difference between killing someone and letting someone die is its effect on one’s conscience.
If I kill someone, I violate their rights. Even if no one would directly know what I did with the invisible button, I’d know what I did, and that would eat at my conscience, and affect how I’d interact with everyone after that. Suddenly, I’d have less trust in myself to do the right thing (to not do what my conscience strongly tells me not to do), and the world would seem like a less safe place because I’d suspect that others would’ve made the same decision I did, and now might be effectively willing to kill me for a mere $6,000 if they could get away with it.
If I let someone die, I don’t violate their rights, and, especially if I don’t directly experience them dying, there’s just less of a pull on my conscience.
One could argue that our consciences don’t make sense and they should be more inline with classic utilitarianism, but I’d argue that we should be extremely careful about making big changes to human consciences in general without thoroughly thinking through and understanding the full range of the effects of these.
Also, I don’t think use of the term “moral obligation” is optimal, since to me it implies a form of emotional bullying/blackmail: you’re not a good person unless you satisfy your moral obligations. Instead, I’d focus on people being true to their own consciences. In my mind, it’s a question of trying to use someone’s self-hate to “beat goodness into them” versus trying to inspire their inner goodness to guide them because that’s what’s ultimately best for them.
By “self-hate,” I mean hate of the parts of ourselves that we think are “bad person” parts, but are really just “human nature” parts that we can accept about ourselves without that meaning we have to indulge them.
Have you tried cooking your best vegan recipes for others? In my experience sometimes people ask for the recipe and make it for themselves later, especially health-conscious people. For instance, I really like this vegan pumpkin pie that’s super easy to make: https://itdoesnttastelikechicken.com/easy-vegan-pumpkin-pie/
Interesting idea, thanks for putting it out there. I’m currently trying to figure out better answers to some of the things you mentioned (at least “better” in terms of more in-line with my own intuitions). For example, I’ve been working on incorporating apparently non-consequentialist considerations into a utilitarian framework:
I’m currently doing this work unpaid and independently. I don’t have a Patreon page for individuals to support it directly, in part because the lack of upvotes on my work has indicated little interest. If you’d like to support my work, though, please consider buying my ebook on honorable speech:
I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:
Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)
#1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”
Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.
Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:
What is latent in human text
When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.
An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.
I’m of the opinion there’s a good chance (and I’d take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.
For a conscience calculator to work as a guard rail system for an AGI, we’ll need an AGI or weak AI to translate reality into numerical parameters: first identifying which conscience breaches apply in a certain situation, drawing from the list in Appendix A, and then estimating the parameters that will go into the “conscience weight” formulas (to be provided in a future post)[1] to calculate the total conscience weight for a given decision option. The system should choose the decision option(s) with the minimum conscience weight. So I’m not saying, “Hey, AGI, don’t make any of the conscience breaches I list in Appendix A, or at least minimize them.” I’m saying, “Hey, human person, bring me that weak AI that doesn’t even really understand what I’m talking about, and let’s have it translate reality into the parameters it’ll need for calculating, using Appendix A and the formulas I’ll provide, what the conscience weights are for each decision option. Then it can output to the AGI (or just be a module in the AGI) which decision option or options have the minimum, or ideally zero, total conscience breach weight. And hopefully those people who’ve been worrying about how to align AGI’s will be able to make the decision option(s) with the minimum conscience breach weight binding on the AGI so it can’t choose anything else.”
Basically, I’m trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything. It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.
Regarding your paragraph 3 about the difficulty of AI understanding our true values:
and that there’s some large probability it implies preventing (human and nonhuman) tragedies in the meantime…
Personally, I’m not comfortable with “large” probabilities of preventing tragedies—people could say that’s the case for “bottom up” ML ethics systems if they manage to achieve >90% accuracy and I’d say, “Oh, man, we’re in trouble if people let an AGI loose thinking that’s good enough.” But this is just a gut feel, really—maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though. My work for AI’s is geared first and foremost towards reducing risks from the first alignable agentic AGI’s to be let out in the world.
Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself:
There’ve been very few attempts to quantify ethics and make it calculable
There’s an under-appreciation, or at least under-emphasis, on the importance of personal responsibility for longterm human well-being
I hope this clears some things up—if not, let me know, thanks!
(Also, this quote looks like a rationalization/sunk-cost-fallacy to me; as I’m not you, I can’t say whether it is for sure. But if I seemed (to someone) to do this, I would want that someone to tell me, so I’m telling you.)
I do appreciate you calling it like you see it, thank you! I don’t think I’m making a rationalization/sunk-cost-fallacy here, but I could be wrong—I seem to see things much differently than the average EA Forum/LessWrong reader as evidenced by the lack of upvotes for my work on trying to figure out how to quantify ethics and conscience for AI’s.
I think perhaps our main point of disagreement is how easy we think it’ll be for an AGI to (a) understand the world well enough to function at a human level over many domains, and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.
Maybe my model for how an AGI would go about figuring out human values and ethics and conscience is flawed, but it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I’m trying to contribute to the literature to speed up its process (that’s not my only motivation for my posts, but it’s one).
Ah, I see, thank you for the clarification. I’m not sure how the trajectory of AGI’s will go, but my worry is that we’ll have some kind of a race dynamic wherein the first AGI’s will quickly have to go on the defensive against bad actors’ AGI’s, and neither will really be at the level you’re talking about in terms of being able to extract a coherent set of human values (which I think would require ASI, since no human has been successful at doing this, as far as I know, but everyday humans can tell what a lie is and what stealing is). If I can create a system that everyday humans can follow, then “everyday” AGI’s should be able to follow it, too, at least to some degree of accuracy. That may be enough to avoid significant collateral damage in a “fight” between some of the first AGI’s to come online. But time will tell… Thanks again for the thought-provoking comment.
If I understand you correctly, you’re saying that any AGI that could apply the system I’m coming up with could just come up with an idealized system better itself, is that right? I don’t know if that’s true (since I don’t know what the first “AGI’s” will really look like), but even if my work only speeds up an AGI’s ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.
Thanks for the post. There are some writings out of the Center for Reducing Suffering that may interest you. They tend to take a negative utilitarian view of things, which has some interesting implications, in particular for the repugnant conclusion(s).
I’ve been trying to come up with my own version of utilitarianism that I believe takes better account of the effects of rights and self-esteem/personal responsibility. In doing so, it’s become more and more apparent to me that our consciences are not naturally classic utilitarian in nature, and this is likely from whence some apparent disagreements between utilitarian implications and our moral intuitions (as from our consciences) arise. I’m planning on writing something up soon on how we might go about quantifying our consciences so that they could be used in a quantitative decision making process (as by an AI) rather than trying to make a full utilitarian framework into a decision making framework for an AI. This has some similarities to what is often suggested by Richard Chappell, i.e., that we follow heuristics (in this case, our consciences) when making decisions rather than some “utilitarian calculus.”
Thanks for the post. I just today was thinking through some aspects of expected value theory and fanaticism (i.e., being fanatic about applying expected value theory) that I think might apply to your post. I had read through some of Hayden Wilkinson’s Global Priorities Institute report from 2021, “In defense of fanaticism,” and he brought up a hypothetical case of donating $2000 (or whatever it takes to statistically save one life) to the Against Malaria Foundation (AMF), or giving the money instead to have a very tiny, non-zero chance of an amazingly valuable future by funding a very speculative research project. I changed the situation for myself to consider why would you give $2000 to AMF instead of donating it to try to reduce existential risk by some tiny amount, when the latter could have significantly higher expected value. I’ve come up with two possible reasons so far to not give your entire $2000 to reducing existential risk, even if you initially intellectually estimate it to have much higher expected value:
As a hedge—how certain are you of how much difference $2000 would make to reducing existential risk? If 8 billion people were going to die and your best guess is that $2000 could reduce the probability of this by, say, 1E-7%/year, the expected value of this in a year would be 8 lives saved, which is more than the 1 life saved by AMF (for simplicity, I’m assuming that 1 life would be saved from malaria for certain, and only considering a timeframe of 1 year). (Also, for ease of discussion, I’m going to ignore all the value lost in future lives un-lived if humans go extinct.) So now you might say your $2000 is estimated to be 8 times more effective if it goes to existential risk reduction than malaria reduction. But how sure are you of the 1E-7%/year number? If the “real” number is 1E-8%/year, now you’re only saving 0.8 life in expectation. The point is, if you assigned some probability distribution to your estimate of existential risk reduction (or even increase), you’d find that some finite percentage of cases in this distribution would favor malaria reduction over existential risk reduction. So the intellectual math of fanatical expected value maximizing, when considered more fully, still supports sending some fraction of money to malaria reduction rather than sending it all to existential risk reduction. (Of course, there’s also the uncertainty of applying expected value theory fanatically, so you could hedge that as well if a different methodology gave different prioritization answers.)
To appear more reasonable to people who mostly follow their gut—“What?! You gave your entire $2000 to some pie in the sky project on supposedly reducing existential risk that might not even be real when you could’ve saved a real person’s life from malaria?!” If you give some fraction of your money to a cause other people are more likely to believe is, in their gut, valuable, such as malaria reduction, you may have more ability to persuade them into seeing existential risk reduction as a reasonable cause for them to donate to as well. Note: I don’t know how much this would reap in terms of dividends for existential risk reduction, but I wouldn’t rule it out.
I don’t know if this is exactly what you were looking for, but these seem to me to be some things to think about to perhaps move your intellectual reasoning closer to your gut, meaning you could be intellectually justified in putting some of your effort into following your gut (how much exactly is open to argument, of course).
In regards to how to make working on existential risk more “gut wrenching,” I tend to think of things in terms of responsibility. If I think I have some ability to help save humanity from extinction or near extinction, and I don’t act on that, and then the world does end, imagining that situation makes me feel like I really dropped the ball on my part of responsibility for the world ending. If I don’t help people avoid dying from malaria, I do still feel a responsibility that I haven’t fully taken up, but that doesn’t hit me as hard as the chance of the world ending, especially if I think I have special skills that might help prevent it. By the way, if I felt like I could make the most difference personally, with my particular skill set and passions, in helping reduce malaria deaths, and other people were much more qualified in the area of existential risk, I’d probably feel more responsibility to apply my talents where I thought they could have the most impact, in that case malaria death reduction.
Thanks for the comment and the link to the review paper!
I think most people, including researchers, don’t have a good handle on what self-esteem is, or at least what truly raises or lowers it—I would expect the effect of praise to be weak, but the effect of promoting responsibility for one’s emotions and actions to be strong. The closest to my views on self-esteem that I’ve found so far are those in N. Branden’s “Six Pillars of Self-Esteem”—the six pillars are living consciously, self-acceptance, self-responsibility, self-assertiveness, living purposefully, and personal integrity.
Unfortunately, because many researchers don’t follow this conception of self-esteem, I tend not to trust much research on the real-world effects of self-esteem. Honestly, though, I haven’t done a hard search for any research that uses something close to my conception of self-esteem, and your comment has basically pointed out that I should get on that, so thank you!
Error
Thank you for the very interesting post! I agree with most of what you’re saying here.
So what is your hypothesis as to why psychopaths don’t currently totally control and dominate society (or do you believe they actually do?)?
Is it because:
“you can manipulate a psychopath by appealing to their desires” which gives you a way to beat them?
they eventually die (before they can amass enough power to take over the world)?
they ultimately don’t work well together because they’re just looking out for themselves, so have no strength in numbers?
they take over whole countries, but there are other countries banded together to defend against them (non-psychopaths hold psychopaths at bay through strength in numbers)?
something else?
Of course, even if the psychopaths among us haven’t (yet) won the ultimate battle for control doesn’t mean psychopathic AGI won’t in the future.
I take the following message from your presentation of the material: “we’re screwed, and there’s no hope.” Was that your intent?
I prefer the following message: “the chances of success with guardian AGI’s may be small, or even extremely small, but such AGI’s may also be the only real chance we’ve got, so let’s go at developing them with full force.” Maybe we should have a Manhattan project on developing “moral” AGI’s?
Here are some arguments that tend toward a slightly more optimistic take than you gave:
Yes, guardian AGI’s will have the disadvantage of constraints compared to “psychopathic” AGI, but if there are enough guardians, perhaps they can (mostly) keep the psychopathic AGI’s at bay through strength in numbers (how exactly the defense-offense balance works out may be key for this, especially because psychopathic AGI’s could form (temporary) alliances as well)
Although it may seem very difficult to figure out how to make moral AGI’s, as AI’s get better, they should increase our chances of being able to figure this out with their help—particularly if people focus specifically on developing AI systems for this purpose (such as through a moral AGI Manhattan project)
Thanks for sharing this interesting Draft Amnesty post. I’ve been thinking a lot about these sorts of things, and want to make a couple of points that may or may not relate to your current beliefs/understandings (I think they’ll relate to someone’s):
Any theory of consequentialism that doesn’t take into account the effects of our actions/inactions on our consciences and thus our well-beings is an incomplete theory of consequentialism (it doesn’t include all consequences). By considering conscience effects, a difference between killing and letting die becomes apparent.
I personally like the “limited number of bets in our lifetimes” argument against following a decision theory fanatically dependent on expected value calculations, i.e., even when probabilities are super low. Basically, if I could make on the order of 10^20 bets in my lifetime, it might make sense to take a bet at a chance of 1 in 10^20 because eventually I’d end up winning, but since I’ll never live long enough to make 10^20 such bets, I shouldn’t take this one bet.
I think there are two concepts one could consider for responsibility for damages: one is for who’s responsible to pay for the damages, and one is for whether someone feels responsible in their conscience for the damages. The first would be affected by how many other people are involved such as if 3 of us pushed a car off a cliff, I might be responsible to pay for 1/3rd of the damages, or even up to the full damages if the other 2 people didn’t have the ability to pay. The second would be affected by if I thought I significantly directly contributed to at least some fraction of the damages, no matter how many other people were involved. Under this second concept of responsibility, I may choose not to eat meat because if I did, I’d feel that I was significantly contributing to the pain and killing of some amount of animals on factory farms.
There doesn’t appear to be a link for:
Thanks for the post, you bring up some interesting points. I think one of the key things that’s missing from Singer’s approach is just how important personal responsibility is to well-being. Unfortunately, I don’t have my alternative framework all figured out yet, but here’s a start towards it. One example is that we have most responsibility for our own children since we brought them into existence and they generally can’t fend for themselves, so, under many circumstances, giving them priority is the most overall well-being-promoting thing to do.
I’m glad to see you are questioning some of the philosophy behind EA, and I hope that more people will do so. I believe a shift to protecting rights (e.g., fighting corruption) and promoting responsibility (of which mental health is a big subset since it involves taking responsibility for your emotions) could potentially help make EA as a movement much more effective.
Do you know about The Dignity Index? Might be interesting to team up with them/get their input.
Thank you for this interesting post, even though I don’t agree with your conclusions.
I believe one key difference between killing someone and letting someone die is its effect on one’s conscience.
If I kill someone, I violate their rights. Even if no one would directly know what I did with the invisible button, I’d know what I did, and that would eat at my conscience, and affect how I’d interact with everyone after that. Suddenly, I’d have less trust in myself to do the right thing (to not do what my conscience strongly tells me not to do), and the world would seem like a less safe place because I’d suspect that others would’ve made the same decision I did, and now might be effectively willing to kill me for a mere $6,000 if they could get away with it.
If I let someone die, I don’t violate their rights, and, especially if I don’t directly experience them dying, there’s just less of a pull on my conscience.
One could argue that our consciences don’t make sense and they should be more inline with classic utilitarianism, but I’d argue that we should be extremely careful about making big changes to human consciences in general without thoroughly thinking through and understanding the full range of the effects of these.
Also, I don’t think use of the term “moral obligation” is optimal, since to me it implies a form of emotional bullying/blackmail: you’re not a good person unless you satisfy your moral obligations. Instead, I’d focus on people being true to their own consciences. In my mind, it’s a question of trying to use someone’s self-hate to “beat goodness into them” versus trying to inspire their inner goodness to guide them because that’s what’s ultimately best for them.
By “self-hate,” I mean hate of the parts of ourselves that we think are “bad person” parts, but are really just “human nature” parts that we can accept about ourselves without that meaning we have to indulge them.
Have you tried cooking your best vegan recipes for others? In my experience sometimes people ask for the recipe and make it for themselves later, especially health-conscious people. For instance, I really like this vegan pumpkin pie that’s super easy to make: https://itdoesnttastelikechicken.com/easy-vegan-pumpkin-pie/
Interesting idea, thanks for putting it out there. I’m currently trying to figure out better answers to some of the things you mentioned (at least “better” in terms of more in-line with my own intuitions). For example, I’ve been working on incorporating apparently non-consequentialist considerations into a utilitarian framework:
https://forum.effectivealtruism.org/posts/S5zJr5zCXc2rzwsdo/a-utilitarian-framework-with-an-emphasis-on-self-esteem-and
https://forum.effectivealtruism.org/posts/fkrEbvw9RWir5ktoP/creating-a-conscience-calculator-to-guard-rail-an-agi
I’m currently doing this work unpaid and independently. I don’t have a Patreon page for individuals to support it directly, in part because the lack of upvotes on my work has indicated little interest. If you’d like to support my work, though, please consider buying my ebook on honorable speech:
Honorable Speech: What Is It, Why Should We Care, and Is It Anywhere to Be Found in U.S. Politics?
Thanks!
I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:
Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)
#1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”
Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.
Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:
When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.
An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.
I’m of the opinion there’s a good chance (and I’d take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.
Let’s say, for sake of argument, that there is a “right” answer.
It’ll have to be autonomous at least over most decisions because humans won’t be able to keep up in real time with AGI’s fighting it out.
I’ll try to clarify my vision:
For a conscience calculator to work as a guard rail system for an AGI, we’ll need an AGI or weak AI to translate reality into numerical parameters: first identifying which conscience breaches apply in a certain situation, drawing from the list in Appendix A, and then estimating the parameters that will go into the “conscience weight” formulas (to be provided in a future post)[1] to calculate the total conscience weight for a given decision option. The system should choose the decision option(s) with the minimum conscience weight. So I’m not saying, “Hey, AGI, don’t make any of the conscience breaches I list in Appendix A, or at least minimize them.” I’m saying, “Hey, human person, bring me that weak AI that doesn’t even really understand what I’m talking about, and let’s have it translate reality into the parameters it’ll need for calculating, using Appendix A and the formulas I’ll provide, what the conscience weights are for each decision option. Then it can output to the AGI (or just be a module in the AGI) which decision option or options have the minimum, or ideally zero, total conscience breach weight. And hopefully those people who’ve been worrying about how to align AGI’s will be able to make the decision option(s) with the minimum conscience breach weight binding on the AGI so it can’t choose anything else.”
Basically, I’m trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything. It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.
Regarding your paragraph 3 about the difficulty of AI understanding our true values:
Personally, I’m not comfortable with “large” probabilities of preventing tragedies—people could say that’s the case for “bottom up” ML ethics systems if they manage to achieve >90% accuracy and I’d say, “Oh, man, we’re in trouble if people let an AGI loose thinking that’s good enough.” But this is just a gut feel, really—maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though. My work for AI’s is geared first and foremost towards reducing risks from the first alignable agentic AGI’s to be let out in the world.
Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself:
There’ve been very few attempts to quantify ethics and make it calculable
There’s an under-appreciation, or at least under-emphasis, on the importance of personal responsibility for longterm human well-being
I hope this clears some things up—if not, let me know, thanks!
Example parameters include people’s ages and life expectancies, and pain levels they may experience.
I do appreciate you calling it like you see it, thank you! I don’t think I’m making a rationalization/sunk-cost-fallacy here, but I could be wrong—I seem to see things much differently than the average EA Forum/LessWrong reader as evidenced by the lack of upvotes for my work on trying to figure out how to quantify ethics and conscience for AI’s.
I think perhaps our main point of disagreement is how easy we think it’ll be for an AGI to (a) understand the world well enough to function at a human level over many domains, and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.
Maybe my model for how an AGI would go about figuring out human values and ethics and conscience is flawed, but it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I’m trying to contribute to the literature to speed up its process (that’s not my only motivation for my posts, but it’s one).
FYI, the above reply is in response to your original reply. I’ll type up a new reply to your edited reply at some later time, thanks.
Ah, I see, thank you for the clarification. I’m not sure how the trajectory of AGI’s will go, but my worry is that we’ll have some kind of a race dynamic wherein the first AGI’s will quickly have to go on the defensive against bad actors’ AGI’s, and neither will really be at the level you’re talking about in terms of being able to extract a coherent set of human values (which I think would require ASI, since no human has been successful at doing this, as far as I know, but everyday humans can tell what a lie is and what stealing is). If I can create a system that everyday humans can follow, then “everyday” AGI’s should be able to follow it, too, at least to some degree of accuracy. That may be enough to avoid significant collateral damage in a “fight” between some of the first AGI’s to come online. But time will tell… Thanks again for the thought-provoking comment.
Thanks for the comment!
If I understand you correctly, you’re saying that any AGI that could apply the system I’m coming up with could just come up with an idealized system better itself, is that right? I don’t know if that’s true (since I don’t know what the first “AGI’s” will really look like), but even if my work only speeds up an AGI’s ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.
Thanks for the post. There are some writings out of the Center for Reducing Suffering that may interest you. They tend to take a negative utilitarian view of things, which has some interesting implications, in particular for the repugnant conclusion(s).
I’ve been trying to come up with my own version of utilitarianism that I believe takes better account of the effects of rights and self-esteem/personal responsibility. In doing so, it’s become more and more apparent to me that our consciences are not naturally classic utilitarian in nature, and this is likely from whence some apparent disagreements between utilitarian implications and our moral intuitions (as from our consciences) arise. I’m planning on writing something up soon on how we might go about quantifying our consciences so that they could be used in a quantitative decision making process (as by an AI) rather than trying to make a full utilitarian framework into a decision making framework for an AI. This has some similarities to what is often suggested by Richard Chappell, i.e., that we follow heuristics (in this case, our consciences) when making decisions rather than some “utilitarian calculus.”
Thanks for the post. I just today was thinking through some aspects of expected value theory and fanaticism (i.e., being fanatic about applying expected value theory) that I think might apply to your post. I had read through some of Hayden Wilkinson’s Global Priorities Institute report from 2021, “In defense of fanaticism,” and he brought up a hypothetical case of donating $2000 (or whatever it takes to statistically save one life) to the Against Malaria Foundation (AMF), or giving the money instead to have a very tiny, non-zero chance of an amazingly valuable future by funding a very speculative research project. I changed the situation for myself to consider why would you give $2000 to AMF instead of donating it to try to reduce existential risk by some tiny amount, when the latter could have significantly higher expected value. I’ve come up with two possible reasons so far to not give your entire $2000 to reducing existential risk, even if you initially intellectually estimate it to have much higher expected value:
As a hedge—how certain are you of how much difference $2000 would make to reducing existential risk? If 8 billion people were going to die and your best guess is that $2000 could reduce the probability of this by, say, 1E-7%/year, the expected value of this in a year would be 8 lives saved, which is more than the 1 life saved by AMF (for simplicity, I’m assuming that 1 life would be saved from malaria for certain, and only considering a timeframe of 1 year). (Also, for ease of discussion, I’m going to ignore all the value lost in future lives un-lived if humans go extinct.) So now you might say your $2000 is estimated to be 8 times more effective if it goes to existential risk reduction than malaria reduction. But how sure are you of the 1E-7%/year number? If the “real” number is 1E-8%/year, now you’re only saving 0.8 life in expectation. The point is, if you assigned some probability distribution to your estimate of existential risk reduction (or even increase), you’d find that some finite percentage of cases in this distribution would favor malaria reduction over existential risk reduction. So the intellectual math of fanatical expected value maximizing, when considered more fully, still supports sending some fraction of money to malaria reduction rather than sending it all to existential risk reduction. (Of course, there’s also the uncertainty of applying expected value theory fanatically, so you could hedge that as well if a different methodology gave different prioritization answers.)
To appear more reasonable to people who mostly follow their gut—“What?! You gave your entire $2000 to some pie in the sky project on supposedly reducing existential risk that might not even be real when you could’ve saved a real person’s life from malaria?!” If you give some fraction of your money to a cause other people are more likely to believe is, in their gut, valuable, such as malaria reduction, you may have more ability to persuade them into seeing existential risk reduction as a reasonable cause for them to donate to as well. Note: I don’t know how much this would reap in terms of dividends for existential risk reduction, but I wouldn’t rule it out.
I don’t know if this is exactly what you were looking for, but these seem to me to be some things to think about to perhaps move your intellectual reasoning closer to your gut, meaning you could be intellectually justified in putting some of your effort into following your gut (how much exactly is open to argument, of course).
In regards to how to make working on existential risk more “gut wrenching,” I tend to think of things in terms of responsibility. If I think I have some ability to help save humanity from extinction or near extinction, and I don’t act on that, and then the world does end, imagining that situation makes me feel like I really dropped the ball on my part of responsibility for the world ending. If I don’t help people avoid dying from malaria, I do still feel a responsibility that I haven’t fully taken up, but that doesn’t hit me as hard as the chance of the world ending, especially if I think I have special skills that might help prevent it. By the way, if I felt like I could make the most difference personally, with my particular skill set and passions, in helping reduce malaria deaths, and other people were much more qualified in the area of existential risk, I’d probably feel more responsibility to apply my talents where I thought they could have the most impact, in that case malaria death reduction.
American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline:
https://dailynous.com/2024/04/25/apa-creates-new-prizes-for-philosophical-research-on-ai/
https://www.apaonline.org/page/ai2050
https://ai2050.schmidtsciences.org/hard-problems/
Thanks for the comment and the link to the review paper!
I think most people, including researchers, don’t have a good handle on what self-esteem is, or at least what truly raises or lowers it—I would expect the effect of praise to be weak, but the effect of promoting responsibility for one’s emotions and actions to be strong. The closest to my views on self-esteem that I’ve found so far are those in N. Branden’s “Six Pillars of Self-Esteem”—the six pillars are living consciously, self-acceptance, self-responsibility, self-assertiveness, living purposefully, and personal integrity.
Unfortunately, because many researchers don’t follow this conception of self-esteem, I tend not to trust much research on the real-world effects of self-esteem. Honestly, though, I haven’t done a hard search for any research that uses something close to my conception of self-esteem, and your comment has basically pointed out that I should get on that, so thank you!
The New York Declaration on Animal Consciousness and an article about it:
https://sites.google.com/nyu.edu/nydeclaration/declaration
https://www.nbcnews.com/science/science-news/animal-consciousness-scientists-push-new-paradigm-rcna148213