my first impression on reading this was feeling like it mostly did not engage substantively with my criticisms. i partly updated away from this after, since the first paragraph includes a possible case the point in my first reply doesn’t apply to (though it also rules out ability to reason about many of the post’s listed statements, so i’m not sure it’s what you intended).
also, your first paragraph is more concrete/gears-level (this is good).
i also identify that paragraph as an inner-alignment[2] structure proposal, i.e not how you described it in the following paragraph (“trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything”). in other words, to the extent your outer alignment[2] proposal requires this structure, it is not implementable if an eventual ‘robust (inner) alignment solution’ from others is not that structure.[3]
also, the complexity of wishes point (mostly the linked post itself) was not addressed.[4] imv it’s a fundamental[5] one.
Personally, I’m not comfortable with “large” probabilities of preventing tragedies
i think, given your first paragraph, one substantive objection could be something like this:
it’s trivially-true that some possible AIs would not understand the surface implications of a CEV sentence, but would understand the implications of each item in the list. the AI design i propose is, for some specific reason, one of these.
using a weak AI ‘plan-classifier’ (compare ‘image classifier’) much less intelligent than the ‘plan enacting/general reasoning’ ‘AGI’ it is {inputting to/part of} changes the equation to one where it’s plausible the classifier would not understand a CEV-instruction sentence (or more generally, be narrow and heuristic-based). this is specific to the proposed weak-plan-classifier/intelligent-reasoner-about-how-enact-selected-plan division.[6]
though, you wrote ‘we’ll need an AGI or weak AI to translate reality into [...]’, and the above would transition to not holding as we move from weaker-than-current[7] systems to more general reasoners.
also, i went back to the list, and many of the items (example: ‘Not holding a human accountable for a conscience breach’) are very complex, and wouldn’t be understandable to the kind of ‘classifier’ i had in mind while writing that quote (i had in mind more simple questions, like ‘is someone directly killed in a step of this plan?’[8]). ‘Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain’ is another kind of complex, because it involves reasoning about effects of a plan on the whole world. it’s not obvious that these are less complex than the inferences i described.
i also notice contradiction to the first paragraph’s picture later: you later write, “that’s why I think my work could help speed up an AGI figuring out ethics for itself”—iiuc the ‘AGI’ you describe would not care to ‘figure out ethics’ but would instead just eternally (or until shut down) enact plans selected by the predecided algorithm involving a plan-classifier (which itself also does not care to ‘figure out new values’ as, per paragraph 1, it does not have values, it itself just outputs something correlating to if an input plan has a certain property)
It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.
this might be true, wrt people (or ‘ai researchers’ or ‘proclaimed safety researchers’) in general, but there’s been a lot of work on outer alignment historically, of a kind that considers it as one of the central problems, and which tries to address fundamental difficulties which this proposal does not seem to comprehend.
also, if an inner alignment solution accepted natural language statements, then for most such inner solutions it would be true that outer alignment is a lot less hard.
maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though.
i don’t know what is meant by ‘common sense’, but it’s not my position that understanding → alignment.
Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself
note my point was about what is latent in human text. it embeds far more than points directly stated, or explicitly known to the author. this quote could still be true under that criteria, but on priors it’s very unlikely for it to be.
(and i still don’t see a non-trivially-possible situation where speeding up an aligned (?) AI’s earliest computations would be relevant)
in general, i find it troublesome to write while trying to reduce ways the text could cause a reader to associativity infer i believe some other thing. so, here’s a general disclaimer that if something is not literally/directly stated by me, i may not believe it.
examples:
defining inner and outer alignment does not imply i’m confident most reachable alignment solutions create systems where these are neatly disentangle-able.
responding to a point doesn’t mean i think the point is important.
not responding to a point or background assumption, or something i say it implies, doesn’t mean i agree with it.
notably, most of this contains a background assumption of an inner alignment solution that accepts a goal in natural language.
which is okay in principle. in general, that has a lot of possible reasons, including ones i endorse like ‘this was new to me, so i’ll process it over time’
just noting this to be clear that i think it’s important, in case the reason was ‘i didn’t understand this or it didn’t seem important’.
as framed, this has some incoherence because it implies the details/impacts of the plan are determined after the plan is selected, while the selection criteria are at least meant to be about the plan’s details/impacts.
Current LLMs already give an okay response to “If you were an AI with the goal of maximizing the values that present altruistic humans would finally endorse after a long reflection period, without yet having precise knowledge of what those values are, what would this goal imply you should do?”.
(I am not implying current LLMs would have no undesirable properties for specifiable queryable functions in an alignment solution)
I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:
Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)
#1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”
Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.
Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:
What is latent in human text
When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.
An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.
I’m of the opinion there’s a good chance (and I’d take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.
[disclaimer because wording this was hard: [1]]
my first impression on reading this was feeling like it mostly did not engage substantively with my criticisms. i partly updated away from this after, since the first paragraph includes a possible case the point in my first reply doesn’t apply to (though it also rules out ability to reason about many of the post’s listed statements, so i’m not sure it’s what you intended).
also, your first paragraph is more concrete/gears-level (this is good).
i also identify that paragraph as an inner-alignment[2] structure proposal, i.e not how you described it in the following paragraph (“trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything”). in other words, to the extent your outer alignment[2] proposal requires this structure, it is not implementable if an eventual ‘robust (inner) alignment solution’ from others is not that structure.[3]
also, the complexity of wishes point (mostly the linked post itself) was not addressed.[4] imv it’s a fundamental[5] one.
this seems a response to wording (‘large probability’) rather than substance. at least in a world more complex than ourselves, probability is all we can attain.
i think, given your first paragraph, one substantive objection could be something like this:
using a weak AI ‘plan-classifier’ (compare ‘image classifier’) much less intelligent than the ‘plan enacting/general reasoning’ ‘AGI’ it is {inputting to/part of} changes the equation to one where it’s plausible the classifier would not understand a CEV-instruction sentence (or more generally, be narrow and heuristic-based). this is specific to the proposed weak-plan-classifier/intelligent-reasoner-about-how-enact-selected-plan division.[6]
though, you wrote ‘we’ll need an AGI or weak AI to translate reality into [...]’, and the above would transition to not holding as we move from weaker-than-current[7] systems to more general reasoners.
also, i went back to the list, and many of the items (example: ‘Not holding a human accountable for a conscience breach’) are very complex, and wouldn’t be understandable to the kind of ‘classifier’ i had in mind while writing that quote (i had in mind more simple questions, like ‘is someone directly killed in a step of this plan?’[8]). ‘Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain’ is another kind of complex, because it involves reasoning about effects of a plan on the whole world. it’s not obvious that these are less complex than the inferences i described.
i also notice contradiction to the first paragraph’s picture later: you later write, “that’s why I think my work could help speed up an AGI figuring out ethics for itself”—iiuc the ‘AGI’ you describe would not care to ‘figure out ethics’ but would instead just eternally (or until shut down) enact plans selected by the predecided algorithm involving a plan-classifier (which itself also does not care to ‘figure out new values’ as, per paragraph 1, it does not have values, it itself just outputs something correlating to if an input plan has a certain property)
this might be true, wrt people (or ‘ai researchers’ or ‘proclaimed safety researchers’) in general, but there’s been a lot of work on outer alignment historically, of a kind that considers it as one of the central problems, and which tries to address fundamental difficulties which this proposal does not seem to comprehend.
also, if an inner alignment solution accepted natural language statements, then for most such inner solutions it would be true that outer alignment is a lot less hard.
i don’t know what is meant by ‘common sense’, but it’s not my position that understanding → alignment.
note my point was about what is latent in human text. it embeds far more than points directly stated, or explicitly known to the author. this quote could still be true under that criteria, but on priors it’s very unlikely for it to be.
(and i still don’t see a non-trivially-possible situation where speeding up an aligned (?) AI’s earliest computations would be relevant)
in general, i find it troublesome to write while trying to reduce ways the text could cause a reader to associativity infer i believe some other thing. so, here’s a general disclaimer that if something is not literally/directly stated by me, i may not believe it.
examples:
defining inner and outer alignment does not imply i’m confident most reachable alignment solutions create systems where these are neatly disentangle-able.
responding to a point doesn’t mean i think the point is important.
not responding to a point or background assumption, or something i say it implies, doesn’t mean i agree with it.
notably, most of this contains a background assumption of an inner alignment solution that accepts a goal in natural language.
‘inner alignment’ meaning “how can we cause something-specific to be intelligently pursued”
and where ‘outer alignment’ means “what should the specified thing be (and how can we construct that specification)”
requiring a specific ‘inner alignment’ structure isn’t per se a problem: some solutions are dual-solutions that are disentangle-ably both at once
which is okay in principle. in general, that has a lot of possible reasons, including ones i endorse like ‘this was new to me, so i’ll process it over time’
just noting this to be clear that i think it’s important, in case the reason was ‘i didn’t understand this or it didn’t seem important’.
in the sense of the opposite of ‘minor implementation details’
as framed, this has some incoherence because it implies the details/impacts of the plan are determined after the plan is selected, while the selection criteria are at least meant to be about the plan’s details/impacts.
Current LLMs already give an okay response to “If you were an AI with the goal of maximizing the values that present altruistic humans would finally endorse after a long reflection period, without yet having precise knowledge of what those values are, what would this goal imply you should do?”.
(I am not implying current LLMs would have no undesirable properties for specifiable queryable functions in an alignment solution)
i write ‘simple’, though to be clear, ‘is alive or dead?’ is not a natural question for all conceivable AIs (e.g., see ‘a toy model/ontology’ here).
I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:
Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)
#1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”
Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.
Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:
When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.
An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.
I’m of the opinion there’s a good chance (and I’d take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.
Let’s say, for sake of argument, that there is a “right” answer.
It’ll have to be autonomous at least over most decisions because humans won’t be able to keep up in real time with AGI’s fighting it out.