I feel like people use “AI alignment” very different. When I talk to the types who are interested in decision theory and agent foundations, they usually have something really sophisticated in mind with AIs that somehow (no known solution because I’m not happy with any of the implementations of UDT that I’ve seen) try to act in such a way as to actually produce evidence that what they want to maximize will be maximized. Other people usually just mean something like “The AI tries to act sort of like a well-intentioned person would.” The first seems good but very very hard; the second seems outright dangerous, depending on details such as the particular idealizations that are applied.
Hence questions like “AI alignment to humans will in practice avoid moral catastrophes …” is a strong no for me because it might not only not prevent but actually produce those catastrophes in the first place.
Idealizations to eliminate the scope insensitivity bias and idealization to eliminate the speciesist and substratist biases are two different kinds of idealizations. My answer changes radically depending on whether they can be disentangled.
Regarding tractability of digital minds work – I’m unsure whether I should count my worries about backfire risks as something that reduces scope or something that reduces tractability.
Regarding the reflective equilibrium, it’s critical to me whether we artificially study the TAI in isolation, which won’t happen in practice, or whether we embed it with other, different agents. The first is probably meant; the second is more pragmatic.
Control strikes me as safer, easier, and less reliable – a stopgap that can buy us a few years. I like that a lot more than an incomplete alignment solution that can backfire.
Suffering risks – vastly more likely in the multipolar world we’re steering towards – strike me as vastly worse than just competing away > 90% of net value, so my max. agree vote feels like an understatement. On the other hand, “will” is a higher probability than what I assign to s-risk (“might”).
1: “Robust alignment” is a deliberately vague term, it’s meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)
4: It’s a hard question, our perspective is that the backfire->cluelessness-> don’t act chain can be thought of as low tractability
5: By “stable under reflection” we meant the AI reflecting on it’s own values (while interacting with the world), where agreement means they wouldn’t change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you’re right that how AIs interact (beyond competition, handled in the last question) is important.
7. S-risks do break the scale and we couldn’t find a good simple way to deal with that (though we’ll do other polls more directly on that later). The intent of “will” was to match 100% expected probability to 100% agree on the scale
Thanks for surveying this! <3
I feel like people use “AI alignment” very different. When I talk to the types who are interested in decision theory and agent foundations, they usually have something really sophisticated in mind with AIs that somehow (no known solution because I’m not happy with any of the implementations of UDT that I’ve seen) try to act in such a way as to actually produce evidence that what they want to maximize will be maximized. Other people usually just mean something like “The AI tries to act sort of like a well-intentioned person would.” The first seems good but very very hard; the second seems outright dangerous, depending on details such as the particular idealizations that are applied.
Hence questions like “AI alignment to humans will in practice avoid moral catastrophes …” is a strong no for me because it might not only not prevent but actually produce those catastrophes in the first place.
Idealizations to eliminate the scope insensitivity bias and idealization to eliminate the speciesist and substratist biases are two different kinds of idealizations. My answer changes radically depending on whether they can be disentangled.
Regarding tractability of digital minds work – I’m unsure whether I should count my worries about backfire risks as something that reduces scope or something that reduces tractability.
Regarding the reflective equilibrium, it’s critical to me whether we artificially study the TAI in isolation, which won’t happen in practice, or whether we embed it with other, different agents. The first is probably meant; the second is more pragmatic.
Control strikes me as safer, easier, and less reliable – a stopgap that can buy us a few years. I like that a lot more than an incomplete alignment solution that can backfire.
Suffering risks – vastly more likely in the multipolar world we’re steering towards – strike me as vastly worse than just competing away > 90% of net value, so my max. agree vote feels like an understatement. On the other hand, “will” is a higher probability than what I assign to s-risk (“might”).
Thanks Dawn, taking these in turn:
1: “Robust alignment” is a deliberately vague term, it’s meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)
4: It’s a hard question, our perspective is that the backfire->cluelessness-> don’t act chain can be thought of as low tractability
5: By “stable under reflection” we meant the AI reflecting on it’s own values (while interacting with the world), where agreement means they wouldn’t change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you’re right that how AIs interact (beyond competition, handled in the last question) is important.
7. S-risks do break the scale and we couldn’t find a good simple way to deal with that (though we’ll do other polls more directly on that later). The intent of “will” was to match 100% expected probability to 100% agree on the scale
Thanks! Then I don’t think I need to update my answers. I’m looking forward to your next batch of questions!