quila comments on Creating a “Conscience Calculator” to Guard-Rail an AGI

quila 12 Aug 2024 17:14 UTC
1 point
0 ∶ 0
If we knew how to create an agent that weights each of all of these individual, human-language rules, in most cases I think this would imply the ability to have the AI pursue a more robust value, e.g their approximation of what the endorsed values of id eal ized x would want them to do. (Which I did just point at in (a) human language. If you have an AI that terminally-follows natural language commands, then you could just write something like what I wrote.)
(I also don’t think this list is robust or agreeable as a list of moral axioms.)
- Sean Sweeney 12 Aug 2024 17:28 UTC
  1 point
  0 ∶ 0
  Parent
  Thanks for the comment!
  If I understand you correctly, you’re saying that any AGI that could apply the system I’m coming up with could just come up with an idealized system better itself, is that right? I don’t know if that’s true (since I don’t know what the first “AGI’s” will really look like), but even if my work only speeds up an AGI’s ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.
  - quila 12 Aug 2024 17:34 UTC
    1 point
    0 ∶ 0
    Parent
    I’m saying that iff you can instruct an AI to follow {list of multiple natural language commands}, then you can also instruct the AI to follow {single natural language command: “follow the values {me / x group / altruistic living beings} {actually value / would endorse after long reflection}”}.
    Approximating what that statement implies is a task of the same kind as approximating what consequences would be caused by actions (which is already also required). It is causally modelling the world.
    If truly aligned to following that statement, it might find approximating this much harder, but reason that at least it probably implies approximating it better, and enabling this to be done; and that there’s some large probability it implies preventing (human and nonhuman) tragedies in the meantime, etc.^[1]
    even if my work only speeds up an AGI’s ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.
    Do you have a model of how it would speed that up (or why ‘create an AI alignable to natural language commands’ is the most feasible alignment solution)?
    Also, I don’t really agree that speeding up an aligned AI’s early computations by a small amount would make a large difference, except in really unlikely scenarois where an aligned ASI and an unaligned ASI are instantiated at nearly the same moment, and-if such a small difference constitutes a decisive advantage.
    (Also, this quote looks like a rationalization/sunk-cost-fallacy to me; as I’m not you, I can’t say whether it is for sure. But if I seemed (to someone) to do this, I would want that someone to tell me, so I’m telling you.)
    ^
    I’m not saying that natural-language-alignment is my mainline solution (this is still conditional on the if in the first paragraph). (I’m currently deconfusing about what kinds of solutions are most feasible, so in some sense I don’t have a mainline solution.)
    This comment is also relevant for what kind of natural language commands we’d want to give for a language-aligned (?) agent, but mostly applies to messier/more-informal systems (systems like current LLMs).
    In any case, I think that ‘figure out what to tell the AI to do in natural language’ wouldn’t be a hard part.
    - Sean Sweeney 12 Aug 2024 18:13 UTC
      1 point
      0 ∶ 0
      Parent
      Ah, I see, thank you for the clarification. I’m not sure how the trajectory of AGI’s will go, but my worry is that we’ll have some kind of a race dynamic wherein the first AGI’s will quickly have to go on the defensive against bad actors’ AGI’s, and neither will really be at the level you’re talking about in terms of being able to extract a coherent set of human values (which I think would require ASI, since no human has been successful at doing this, as far as I know, but everyday humans can tell what a lie is and what stealing is). If I can create a system that everyday humans can follow, then “everyday” AGI’s should be able to follow it, too, at least to some degree of accuracy. That may be enough to avoid significant collateral damage in a “fight” between some of the first AGI’s to come online. But time will tell… Thanks again for the thought-provoking comment.
      - quila 12 Aug 2024 18:14 UTC
        1 point
        0 ∶ 0
        Parent
        which I think would require ASI
        I edited in a paragraph (the third one) about this while you were writing (probably).
        (As another example, I’m not a superintelligence but I am trying to pursue the values I’d endorse on reflection, which I think will imply (if not explicitly include as axioms) enabling such reflection to happen and the other things I wrote above)
        Sean Sweeney 13 Aug 2024 2:19 UTC
        1 point
        0 ∶ 0
        Parent
        (Also, this quote looks like a rationalization/sunk-cost-fallacy to me; as I’m not you, I can’t say whether it is for sure. But if I seemed (to someone) to do this, I would want that someone to tell me, so I’m telling you.)
        I do appreciate you calling it like you see it, thank you! I don’t think I’m making a rationalization/sunk-cost-fallacy here, but I could be wrong—I seem to see things much differently than the average EA Forum/LessWrong reader as evidenced by the lack of upvotes for my work on trying to figure out how to quantify ethics and conscience for AI’s.
        I think perhaps our main point of disagreement is how easy we think it’ll be for an AGI to (a) understand the world well enough to function at a human level over many domains, and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.
        Maybe my model for how an AGI would go about figuring out human values and ethics and conscience is flawed, but it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I’m trying to contribute to the literature to speed up its process (that’s not my only motivation for my posts, but it’s one).
        quila 14 Aug 2024 23:37 UTC
        1 point
        0 ∶ 0
        Parent
        and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.
        again the referenced paragraph applies
        my work on trying to figure out how to quantify ethics and conscience for AI’s
        a fundamental problem that i perceived is that it’s not specifying a (‘value’) function programmatically. by default, one can’t just send a neural network or other program a set of human-language instructions for it to automatically care about it (even if it’s intelligent enough or specialized to language enough to understand them).
        it could be that you’re expecting future {predictive model}-based agents (specifically) to be like that though (either internally/precisely inner aligning themselves to some set of instructions*, or approximately/behaviorally (edit: the next paragraph applies to this too)), which is more defensible. in that case, i’d suggest writing down a model for why you expect that.
        *in which case this list would be fraught in that place for other reasons (animated version). in that light, it could be inferred that, unless you’re trying to construct a set of instructions with no edge cases, you’ve implied the AI infers your intent/inner values from your words and follows them instead by default, even though the words meanings do not specify to do this, unlike in the case of CEV-instruction words (described initially).
        it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I’m trying to contribute to the literature to speed up its process (that’s not my only motivation for my posts, but it’s one).
        if a transformative AI cares about the intended values and just needs to figure them out, then the alignment problem is already solved. put a different way, this assumes an unknown solution to alignment be found in advance, at which point the list could only marginally have the quoted effect^[1]
        a “fight” between some of the first AGI’s to come online
        i think something adjacent to this is non-trivially possible (more in the form of between {groups made of humans, like companies and states} using predictive models, or a result of selection processes) (some posts that feel related: live theory, what failure looks like), but i don’t see how this list would help in that case either.
        ^
        i think it’s also further marginal because the list is mostly ‘surface level’, and so it’s easy (for humans and at least AIs trained on anthropic data) to come up with similar lists. for example, i think the rest of the post probably contains more information about your values and inner psychology than the list itself. and with (unverified estimate from google) >100 million books, additional text is very marginal evidence (about anything), unless it’s imbued with information about something that hasn’t made its way into text in the past (like writing about AI phenomena, or maybe the writings of someone with a very rare kind of mind).
        Sean Sweeney 15 Aug 2024 2:13 UTC
        1 point
        0 ∶ 0
        Parent
        I’ll try to clarify my vision:
        For a conscience calculator to work as a guard rail system for an AGI, we’ll need an AGI or weak AI to translate reality into numerical parameters: first identifying which conscience breaches apply in a certain situation, drawing from the list in Appendix A, and then estimating the parameters that will go into the “conscience weight” formulas (to be provided in a future post)^[1] to calculate the total conscience weight for a given decision option. The system should choose the decision option(s) with the minimum conscience weight. So I’m not saying, “Hey, AGI, don’t make any of the conscience breaches I list in Appendix A, or at least minimize them.” I’m saying, “Hey, human person, bring me that weak AI that doesn’t even really understand what I’m talking about, and let’s have it translate reality into the parameters it’ll need for calculating, using Appendix A and the formulas I’ll provide, what the conscience weights are for each decision option. Then it can output to the AGI (or just be a module in the AGI) which decision option or options have the minimum, or ideally zero, total conscience breach weight. And hopefully those people who’ve been worrying about how to align AGI’s will be able to make the decision option(s) with the minimum conscience breach weight binding on the AGI so it can’t choose anything else.”
        Basically, I’m trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything. It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.
        Regarding your paragraph 3 about the difficulty of AI understanding our true values:
        and that there’s some large probability it implies preventing (human and nonhuman) tragedies in the meantime…
        Personally, I’m not comfortable with “large” probabilities of preventing tragedies—people could say that’s the case for “bottom up” ML ethics systems if they manage to achieve >90% accuracy and I’d say, “Oh, man, we’re in trouble if people let an AGI loose thinking that’s good enough.” But this is just a gut feel, really—maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though. My work for AI’s is geared first and foremost towards reducing risks from the first alignable agentic AGI’s to be let out in the world.
        Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself:
        There’ve been very few attempts to quantify ethics and make it calculable
        There’s an under-appreciation, or at least under-emphasis, on the importance of personal responsibility for longterm human well-being
        I hope this clears some things up—if not, let me know, thanks!
        ^
        Example parameters include people’s ages and life expectancies, and pain levels they may experience.
        quila 15 Aug 2024 4:57 UTC
        1 point
        0 ∶ 0
        Parent
        [disclaimer because wording this was hard: ^[1]]
        my first impression on reading this was feeling like it mostly did not engage substantively with my criticisms. i partly updated away from this after, since the first paragraph includes a possible case the point in my first reply doesn’t apply to (though it also rules out ability to reason about many of the post’s listed statements, so i’m not sure it’s what you intended).
        also, your first paragraph is more concrete/gears-level (this is good).
        i also identify that paragraph as an inner-alignment^[2] structure proposal, i.e not how you described it in the following paragraph (“trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything”). in other words, to the extent your outer alignment^[2] proposal requires this structure, it is not implementable if an eventual ‘robust (inner) alignment solution’ from others is not that structure.^[3]
        also, the complexity of wishes point (mostly the linked post itself) was not addressed.^[4] imv it’s a fundamental^[5] one.
        Personally, I’m not comfortable with “large” probabilities of preventing tragedies
        this seems a response to wording (‘large probability’) rather than substance. at least in a world more complex than ourselves, probability is all we can attain.
        i think, given your first paragraph, one substantive objection could be something like this:
        it’s trivially-true that some possible AIs would not understand the surface implications of a CEV sentence, but would understand the implications of each item in the list. the AI design i propose is, for some specific reason, one of these.
        using a weak AI ‘plan-classifier’ (compare ‘image classifier’) much less intelligent than the ‘plan enacting/general reasoning’ ‘AGI’ it is {inputting to/part of} changes the equation to one where it’s plausible the classifier would not understand a CEV-instruction sentence (or more generally, be narrow and heuristic-based). this is specific to the proposed weak-plan-classifier/intelligent-reasoner-about-how-enact-selected-plan division.^[6]
        though, you wrote ‘we’ll need an AGI or weak AI to translate reality into [...]’, and the above would transition to not holding as we move from weaker-than-current^[7] systems to more general reasoners.
        also, i went back to the list, and many of the items (example: ‘Not holding a human accountable for a conscience breach’) are very complex, and wouldn’t be understandable to the kind of ‘classifier’ i had in mind while writing that quote (i had in mind more simple questions, like ‘is someone directly killed in a step of this plan?’^[8]). ‘Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain’ is another kind of complex, because it involves reasoning about effects of a plan on the whole world. it’s not obvious that these are less complex than the inferences i described.
        i also notice contradiction to the first paragraph’s picture later: you later write, “that’s why I think my work could help speed up an AGI figuring out ethics for itself”—iiuc the ‘AGI’ you describe would not care to ‘figure out ethics’ but would instead just eternally (or until shut down) enact plans selected by the predecided algorithm involving a plan-classifier (which itself also does not care to ‘figure out new values’ as, per paragraph 1, it does not have values, it itself just outputs something correlating to if an input plan has a certain property)
        It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.
        this might be true, wrt people (or ‘ai researchers’ or ‘proclaimed safety researchers’) in general, but there’s been a lot of work on outer alignment historically, of a kind that considers it as one of the central problems, and which tries to address fundamental difficulties which this proposal does not seem to comprehend.
        also, if an inner alignment solution accepted natural language statements, then for most such inner solutions it would be true that outer alignment is a lot less hard.
        maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though.
        i don’t know what is meant by ‘common sense’, but it’s not my position that understanding → alignment.
        Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself
        note my point was about what is latent in human text. it embeds far more than points directly stated, or explicitly known to the author. this quote could still be true under that criteria, but on priors it’s very unlikely for it to be.
        (and i still don’t see a non-trivially-possible situation where speeding up an aligned (?) AI’s earliest computations would be relevant)
        ^
        in general, i find it troublesome to write while trying to reduce ways the text could cause a reader to associativity infer i believe some other thing. so, here’s a general disclaimer that if something is not literally/directly stated by me, i may not believe it.
        examples:
        defining inner and outer alignment does not imply i’m confident most reachable alignment solutions create systems where these are neatly disentangle-able.
        responding to a point doesn’t mean i think the point is important.
        not responding to a point or background assumption, or something i say it implies, doesn’t mean i agree with it.
        notably, most of this contains a background assumption of an inner alignment solution that accepts a goal in natural language.
        ^
        ‘inner alignment’ meaning “how can we cause something-specific to be intelligently pursued”
        and where ‘outer alignment’ means “what should the specified thing be (and how can we construct that specification)”
        ^
        requiring a specific ‘inner alignment’ structure isn’t per se a problem: some solutions are dual-solutions that are disentangle-ably both at once
        ^
        which is okay in principle. in general, that has a lot of possible reasons, including ones i endorse like ‘this was new to me, so i’ll process it over time’
        just noting this to be clear that i think it’s important, in case the reason was ‘i didn’t understand this or it didn’t seem important’.
        ^
        in the sense of the opposite of ‘minor implementation details’
        ^
        as framed, this has some incoherence because it implies the details/impacts of the plan are determined after the plan is selected, while the selection criteria are at least meant to be about the plan’s details/impacts.
        ^
        Current LLMs already give an okay response to “If you were an AI with the goal of maximizing the values that present altruistic humans would finally endorse after a long reflection period, without yet having precise knowledge of what those values are, what would this goal imply you should do?”.
        (I am not implying current LLMs would have no undesirable properties for specifiable queryable functions in an alignment solution)
        ^
        i write ‘simple’, though to be clear, ‘is alive or dead?’ is not a natural question for all conceivable AIs (e.g., see ‘a toy model/ontology’ here).
        Sean Sweeney 15 Aug 2024 20:27 UTC
        1 point
        0 ∶ 0
        Parent
        I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:
        Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
        Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)
        #1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”
        Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.
        Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:
        What is latent in human text
        When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.
        An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.^[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.
        I’m of the opinion there’s a good chance (and I’d take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.^[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.
        ^
        Let’s say, for sake of argument, that there is a “right” answer.
        ^
        It’ll have to be autonomous at least over most decisions because humans won’t be able to keep up in real time with AGI’s fighting it out.
      - Sean Sweeney 12 Aug 2024 18:14 UTC
        1 point
        0 ∶ 0
        Parent
        FYI, the above reply is in response to your original reply. I’ll type up a new reply to your edited reply at some later time, thanks.