Please spend two minutes filling in the below polls!
Planning where we focus at CaML requires forming views on many controversial questions, particularly with regards to alignment. In many cases, people we’ve talked to have very different intuitions about where the alignment community stands on these issues. These polls will help us get a sense of where the main areas of (dis)agreement lie.
Please feel free to tell us if you think the questions are ambiguous or embed false assumptions.
EDIT: Please answer based on your own best guess (and confidence) in these questions.
Thanks for surveying this! <3
I feel like people use “AI alignment” very different. When I talk to the types who are interested in decision theory and agent foundations, they usually have something really sophisticated in mind with AIs that somehow (no known solution because I’m not happy with any of the implementations of UDT that I’ve seen) try to act in such a way as to actually produce evidence that what they want to maximize will be maximized. Other people usually just mean something like “The AI tries to act sort of like a well-intentioned person would.” The first seems good but very very hard; the second seems outright dangerous, depending on details such as the particular idealizations that are applied.
Hence questions like “AI alignment to humans will in practice avoid moral catastrophes …” is a strong no for me because it might not only not prevent but actually produce those catastrophes in the first place.
Idealizations to eliminate the scope insensitivity bias and idealization to eliminate the speciesist and substratist biases are two different kinds of idealizations. My answer changes radically depending on whether they can be disentangled.
Regarding tractability of digital minds work – I’m unsure whether I should count my worries about backfire risks as something that reduces scope or something that reduces tractability.
Regarding the reflective equilibrium, it’s critical to me whether we artificially study the TAI in isolation, which won’t happen in practice, or whether we embed it with other, different agents. The first is probably meant; the second is more pragmatic.
Control strikes me as safer, easier, and less reliable – a stopgap that can buy us a few years. I like that a lot more than an incomplete alignment solution that can backfire.
Suffering risks – vastly more likely in the multipolar world we’re steering towards – strike me as vastly worse than just competing away > 90% of net value, so my max. agree vote feels like an understatement. On the other hand, “will” is a higher probability than what I assign to s-risk (“might”).
Thanks Dawn, taking these in turn:
1: “Robust alignment” is a deliberately vague term, it’s meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)
4: It’s a hard question, our perspective is that the backfire->cluelessness-> don’t act chain can be thought of as low tractability
5: By “stable under reflection” we meant the AI reflecting on it’s own values (while interacting with the world), where agreement means they wouldn’t change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you’re right that how AIs interact (beyond competition, handled in the last question) is important.
7. S-risks do break the scale and we couldn’t find a good simple way to deal with that (though we’ll do other polls more directly on that later). The intent of “will” was to match 100% expected probability to 100% agree on the scale
Thanks! Then I don’t think I need to update my answers. I’m looking forward to your next batch of questions!
I’d say this is the wrong question. Like, I do not expect that any current alignment approach is going to work. If we do ever figure out what works, it will not look like “pretraining” or “post-training”, it will be something completely different.
Although I guess you could call that “pretraining”?
Thanks Michael, we avoided mentioning post-training to imply that “new paradigm needed” would also count on the “disagree” side of the spectrum. In other words, “disagree” on this question would mean either “post-training is sufficient” or “new paradigms are needed/sufficient”.
Alignment to what? We don’t have a standard model of cognition. We’re essentially like alchemists before the periodic table and seem to be about as aware of the lack of a standard model as they were of the table. Lots of math, guesses, mystifications, surprises, accidents, and impressive results from “recipes” bound to less-than-impressive explanations.
A standard model not only provides a set of stable terms and relations to serve current explanation, it provides the framework for and optimization of how we go about forming and selecting lines of research. It becomes the basis of ongoing inquiries.
This isn’t an exotic expectation. Any mature science has a standard model, albeit (and fortunately) evolving. Almost any time I point this out to a ML scientist or engineer, it’s deer in the headlights.
If we’re going to engineer something that approximates intelligence, and we have no common standard model of intelligence...do I need to say more???