Thanks for this talk/post—It’s a good example of the sort of self-skepticism that I think we should encourage.
FWIW, I think it’s a mistake to construe the classic model of AI accident catastrophe as capability gain first, then goal acquisition. I say this because (a) I never interpreted it that way when reading the classic texts, and (b) it doesn’t really make sense—the original texts are very clear that the massive jump in AI capability is supposed to come from recursive self-improvement, i.e. the AI helping to do AI research. So already we have some sort of goal-directed behavior (bracketing CAIS/ToolAI objections!) leading up to and including the point of arrival at superintelligence.
I would construe the little sci-fi stories about putting goals into goal slots as not being a prediction about the architecture of AI but rather illustrations of completely different points about e.g. orthogonality of value or the dangers of unaligned superintelligences.
At any rate, though, what does it matter whether the goal is put in after the capability growth, or before/during? Obviously, it matters, but it doesn’t matter for purposes of evaluating the priority of AI safety work, since in both cases the potential for accidental catastrophe exists.
the original texts are very clear that the massive jump in AI capability is supposed to come from recursive self-improvement, i.e. the AI helping to do AI research
...because that AI research is useful for some other goal the AI has, such as maximizing paperclips. See the instrumental convergence thesis.
At any rate, though, what does it matter whether the goal is put in after the capability growth, or before/during? Obviously, it matters, but it doesn’t matter for purposes of evaluating the priority of AI safety work, since in both cases the potential for accidental catastrophe exists.
The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI. If capability growth comes before a goal is granted, it seems less likely that misunderstanding will occur.
The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI.
I don’t think this is correct. The argument rests on AIs having any values which aren’t human values (e.g. maximising paperclips), not just misunderstood human values.
Maximising paperclips is a misunderstood human value. Some lazy factory owners says, gee wouldn’t it be great if I could get an AI to make my paperclips for me? Then builds an AGI and asks it to make paperclips, and it then makes everything into paperclips its utility function being unreflective of its owners true desire to also have a world.
If there is a flaw here it’s probably somewhere in thinking that AGI will get built as some sort of intermediate tool and that it will be easy to rub the lamp and ask the genie to do something in easy to misunderstand natural language.
Presumably the programmer will make some effort to embed the right set of values in the AI. If this is an easy task, doom is probably not the default outcome.
AI pessimists have argued human values will be difficult to communicate due to their complexity. But as AI capabilities improve, AI systems get better at learning complex things.
Both the instrumental convergence thesis and the complexity of value thesis are key parts of the argument for AI pessimism as it’s commonly presented. Are you claiming that they aren’t actually necessary for the argument to be compelling? (If so, why were they included in the first place? This sounds a bit like justification drift.)
...because that AI research is useful for some other goal the AI has, such as maximizing paperclips. See the instrumental convergence thesis.
Yes, exactly.
The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI. If capability growth comes before a goal is granted, it seems less likely that misunderstanding will occur.
Eh, I could see arguments that it would be less likely and arguments that it would be more likely. Argument that it is less likely: We can use the capabilities to do something like “Do what we mean,” allowing us to state our goals imprecisely & survive. Argument that it is more likely: If we mess up, we immediately have an unaligned superintelligence on our hands. At least if the goals come before the capability growth, there is a period where we might be able to contain it and test it, since it isn’t capable of escaping or concealing its intentions.
Hello from the 4 years into the future! Just a random note on the thing you said,
Argument that it is less likely: We can use the capabilities to do something like “Do what we mean,” allowing us to state our goals imprecisely & survive.
Anthropic is now doing exactly this with their Constitutional AI. They let the chatbot respond in some way, then they ask it “reformulate the text so that it is more ethical”, and finally train it to output something more akin to the latter rather than to the former.
Thanks for this talk/post—It’s a good example of the sort of self-skepticism that I think we should encourage.
FWIW, I think it’s a mistake to construe the classic model of AI accident catastrophe as capability gain first, then goal acquisition. I say this because (a) I never interpreted it that way when reading the classic texts, and (b) it doesn’t really make sense—the original texts are very clear that the massive jump in AI capability is supposed to come from recursive self-improvement, i.e. the AI helping to do AI research. So already we have some sort of goal-directed behavior (bracketing CAIS/ToolAI objections!) leading up to and including the point of arrival at superintelligence.
I would construe the little sci-fi stories about putting goals into goal slots as not being a prediction about the architecture of AI but rather illustrations of completely different points about e.g. orthogonality of value or the dangers of unaligned superintelligences.
At any rate, though, what does it matter whether the goal is put in after the capability growth, or before/during? Obviously, it matters, but it doesn’t matter for purposes of evaluating the priority of AI safety work, since in both cases the potential for accidental catastrophe exists.
...because that AI research is useful for some other goal the AI has, such as maximizing paperclips. See the instrumental convergence thesis.
The argument for doom by default seems to rest on a default misunderstanding of human values as the programmer attempts to communicate them to the AI. If capability growth comes before a goal is granted, it seems less likely that misunderstanding will occur.
I don’t think this is correct. The argument rests on AIs having any values which aren’t human values (e.g. maximising paperclips), not just misunderstood human values.
Maximising paperclips is a misunderstood human value. Some lazy factory owners says, gee wouldn’t it be great if I could get an AI to make my paperclips for me? Then builds an AGI and asks it to make paperclips, and it then makes everything into paperclips its utility function being unreflective of its owners true desire to also have a world.
If there is a flaw here it’s probably somewhere in thinking that AGI will get built as some sort of intermediate tool and that it will be easy to rub the lamp and ask the genie to do something in easy to misunderstand natural language.
Presumably the programmer will make some effort to embed the right set of values in the AI. If this is an easy task, doom is probably not the default outcome.
AI pessimists have argued human values will be difficult to communicate due to their complexity. But as AI capabilities improve, AI systems get better at learning complex things.
Both the instrumental convergence thesis and the complexity of value thesis are key parts of the argument for AI pessimism as it’s commonly presented. Are you claiming that they aren’t actually necessary for the argument to be compelling? (If so, why were they included in the first place? This sounds a bit like justification drift.)
Yes, exactly.
Eh, I could see arguments that it would be less likely and arguments that it would be more likely. Argument that it is less likely: We can use the capabilities to do something like “Do what we mean,” allowing us to state our goals imprecisely & survive. Argument that it is more likely: If we mess up, we immediately have an unaligned superintelligence on our hands. At least if the goals come before the capability growth, there is a period where we might be able to contain it and test it, since it isn’t capable of escaping or concealing its intentions.
Hello from the 4 years into the future! Just a random note on the thing you said,
Argument that it is less likely: We can use the capabilities to do something like “Do what we mean,” allowing us to state our goals imprecisely & survive.
Anthropic is now doing exactly this with their Constitutional AI. They let the chatbot respond in some way, then they ask it “reformulate the text so that it is more ethical”, and finally train it to output something more akin to the latter rather than to the former.
Yep! I love when old threads get resurrected.