At best, these theory-first efforts did very little to improve our understanding of how to align powerful AI. And they may have been net negative, insofar as they propagated a variety of actively misleading ways of thinking both among alignment researchers and the broader public. Some examples include the now-debunked analogy from evolution, the false distinction between “inner” and “outer” alignment, and the idea that AIs will be rigid utility maximizing consequentialists (here, here, and here).
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it’s arguing against.
The second post presents an interesting case for a set of claims that are different from “there is no distinction between inner and outer alignment”; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes’ comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it’s arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it’s most relevant here is as a demonstration of how important it is to be careful and precise with one’s claims. (The post is not making an argument about whether AIs will be “rigid utility maximizing consequentialists”, it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin’s comment a year after writing that post indicated that he thinks we’re likely to develop goal-directed agents; he just doesn’t think that’s entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I’m not really sure what my takeaway is here, except that I didn’t go scouring the essay for mistakes—the citation of Quintin’s post was just the first thing that jumped out at me, since that wasn’t all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
Random aside, but I think this paragraph is unjustified in both its core argument (that the referenced theory-first efforts propagated actively misleading ways of thinking about alignment) and none of the citations provide the claimed support.
The first post (re: evolutionary analogy as evidence for a sharp left turn) sees substantial pushback in the comments, and that pushback seems more correct to me than not, and in any case seems to misunderstand the position it’s arguing against.
The second post presents an interesting case for a set of claims that are different from “there is no distinction between inner and outer alignment”; I do not consider it to be a full refutation of that conceptual distinction. (See also Steven Byrnes’ comment.)
The third post is at best playing games with the definitions of words (or misunderstanding the thing it’s arguing against), at worst is just straightforwardly wrong.
I have less context on the fourth post, but from a quick skim of both the post and the comments, I think the way it’s most relevant here is as a demonstration of how important it is to be careful and precise with one’s claims. (The post is not making an argument about whether AIs will be “rigid utility maximizing consequentialists”, it is making a variety of arguments about whether coherence theorems necessarily require that whatever ASI we might build will behave in a goal-directed way. Relatedly, Rohin’s comment a year after writing that post indicated that he thinks we’re likely to develop goal-directed agents; he just doesn’t think that’s entailed by arguments from coherence theorems, which may or may not have been made by e.g. Eliezer in other essays.)
My guess is that you did not include the fifth post as a smoke test to see if anyone was checking your citations, but I am having trouble coming up with a charitable explanation for its inclusion in support of your argument.
I’m not really sure what my takeaway is here, except that I didn’t go scouring the essay for mistakes—the citation of Quintin’s post was just the first thing that jumped out at me, since that wasn’t all that long ago. I think the claims made in the paragraph are basically unsupported by the evidence, and the evidence itself is substantially mischaracterized. Based on other comments it looks like this is true of a bunch of other substantial claims and arguments in the post:
that Bostrom’s core argument has aged poorly
CIRL being widely considered irrelevant
whether proposed pauses are intended to be temporary[1]
Though I’m sort of confused about what this back-and-forth is talking about, since it’s referencing behind-the-scenes stuff that I’m not privy to.