How does Rationalist Community Attention/Consensus compare? I’d like to mention a paper of mine published at the top AI theory conference which proves that when a certain parameter of a certain agent is set sufficiently high, the agent will notaim to kill everyone, while still achieving at least human-level intelligence. This follows from Corollary 14 and Corollary 6. I am quite sure most AI safety researchers would have confidently predicted no such theorems ever appearing in the academic literature. And yet there are no traces of any minds being blown. The associated Alignment Forum post only has 22 upvotes and one comment, and I bet you’ve never heard any of your EA friends discuss it. It hasn’t appeared, to my knowledge, in any AI safety syllabuses. People don’t seem to bother investigating or discussing whether their concerns with the proposal are surmountable. I’m reluctant to bring up this example since it has the air of a personal grievance, but I think the disinterest from the Rationality Community is erroneous enough that it calls for an autopsy. (To be clear, I’m not saying everyone should be hailing this as an answer to AI existential risk, only that it should definitely be of significant interest.)
I’m someone who has read your work (this paper and FGOIL, the latter of which I have included in a syllabus), and who would like to see more work in similar vein, as well as more formalism in AI safety. I say this to establish my bona fides, the way you established your AI safety bona fides.
I don’t think this paper is mind-blowing, and I would call it representative of one of the ways in which tailoring theoretical work for the peer-review process can go wrong. In particular, you don’t show that “when a certain parameter of a certain agent is set sufficiently high, the agent will notaim to kill everyone”, you show something more like “when you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones and incorporate a human into the process who has access to the ground truth of the universe, then you can set a parameter high enough that the agent will not aim to kill everyone” [edit: Michael disputes this last point, see his comment below and my response], which is not at all the same thing. The standard academic failure mode is to make a number of assumptions for tractability that severely lower the relevance of the results (and the more pernicious failure mode is to hide those assumptions).
You’d be right if you said that most AI safety people did not read the paper and come to that conclusion themselves, and even if you said that most weren’t even aware of it. Very little of the community has the relevant background for it (and I would like to see a shift in that direction), especially the newcomers that are the targets of syllabi. All that said, I’m confident that you got enough qualified eyes on it that if you had shown what you said in your summary, it would have had an impact similar in scale to what you think is appropriate.
This comment is somewhat of a digression from the main post, but I am concerned that if someone took your comments about the paper at face value, they would come away with an overly negative perception of how the AI safety community engages with academic work.
The standard academic failure mode is to make a number of assumptions for tractability that severely lower the relevance of the results (and the more pernicious failure mode is to hide those assumptions).
Perhaps, but at least these assumptions are stated. Most work leans on similarly strong assumptions (for tractability, brevity, or lack of rigour meaning you don’t even realise you are doing it) but doesn’t state them.
I’m someone who has read your work (this paper and FGOIL, the latter of which I have included in a syllabus), and who would like to see more work in similar vein, as well as more formalism in AI safety. I say this to establish my bona fides, the way you established your AI safety bona fides.
Thanks! I should have clarified it has received some interest from some people.
you don’t show that “when a certain parameter of a certain agent is set sufficiently high, the agent will notaim to kill everyone”, you show something more like “when you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones and incorporate a human into the process who has access to the ground truth of the universe, then you can set a parameter high enough that the agent will not aim to kill everyone”
“When you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones”. That is the “certain agent” I am talking about. “Restrict” is an odd word choice, since the set can be as large as you like as long as it contains the truth. “and incorporate a human into the process who has access to the ground truth of the universe.” This is incorrect; can I ask you to edit your comment? Absolutely nothing is assumed about the human mentor, certainly no access to the ground truth of the universe; it could be a two-year-old or a corpse! That would just make the Mentor-Level Performance Corollary less impressive.
I don’t deny that certain choices about the agent’s design make it intractable. This is why my main criticism was “People don’t seem to bother investigating or discussing whether their concerns with the proposal are surmountable.” Algorithm design for improved tractability is the bread and butter of computer science.
I’ll edit to comment to note that you dispute it, but I stand by the comment. The AI system trained is only as safe as the mentor, so the system is only safe if the mentor knows what is safe. By “restrict”, I meant for performance reasons, so that it’s feasible to train and deploy in new environments.
Again, I like your work and would like to see more similar work from you and others. I am just disputing the way you summarized it in this post, because I think that portrayal makes its lack of splash in the alignment community a much stronger point against the community’s epistemics than it deserves.
Thank you for the edit, and thank you again for your interest. I’m still not sure what you mean by a person “having access to the ground truth of the universe”. There’s just no sense I can think of where it is true that this a requirement for the mentor.
“The system is only safe if the mentor knows what is safe.” It’s true that if the mentor kills everyone, then the combined mentor-agent system would kill everyone, but surely that fact doesn’t weight against this proposal at all. In any case, more importantly a) the agent will not aim to kill everyone regardless of whether the mentor would (Corollary 14), which I think refutes your comment. And b) for no theorem in the paper does the mentor need to know what is safe; for Theorem 11 to be interesting, he just needs to act safely (an important difference for a concept so tricky to articulate!). But I decided these details were beside the point for this post, which is why I only cited Corollary 14 in the OP, not Theorem 11.
I’m someone who has read your work (this paper and FGOIL, the latter of which I have included in a syllabus), and who would like to see more work in similar vein, as well as more formalism in AI safety. I say this to establish my bona fides, the way you established your AI safety bona fides.
I don’t think this paper is mind-blowing, and I would call it representative of one of the ways in which tailoring theoretical work for the peer-review process can go wrong. In particular, you don’t show that “when a certain parameter of a certain agent is set sufficiently high, the agent will not aim to kill everyone”, you show something more like “when you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones and incorporate a human into the process who has access to the ground truth of the universe, then you can set a parameter high enough that the agent will not aim to kill everyone” [edit: Michael disputes this last point, see his comment below and my response], which is not at all the same thing. The standard academic failure mode is to make a number of assumptions for tractability that severely lower the relevance of the results (and the more pernicious failure mode is to hide those assumptions).
You’d be right if you said that most AI safety people did not read the paper and come to that conclusion themselves, and even if you said that most weren’t even aware of it. Very little of the community has the relevant background for it (and I would like to see a shift in that direction), especially the newcomers that are the targets of syllabi. All that said, I’m confident that you got enough qualified eyes on it that if you had shown what you said in your summary, it would have had an impact similar in scale to what you think is appropriate.
This comment is somewhat of a digression from the main post, but I am concerned that if someone took your comments about the paper at face value, they would come away with an overly negative perception of how the AI safety community engages with academic work.
Perhaps, but at least these assumptions are stated. Most work leans on similarly strong assumptions (for tractability, brevity, or lack of rigour meaning you don’t even realise you are doing it) but doesn’t state them.
Thanks! I should have clarified it has received some interest from some people.
“When you can design and implement an agent that acts and updates its beliefs in a certain way and can restrict the initial beliefs to a set containing the desired ones”. That is the “certain agent” I am talking about. “Restrict” is an odd word choice, since the set can be as large as you like as long as it contains the truth. “and incorporate a human into the process who has access to the ground truth of the universe.” This is incorrect; can I ask you to edit your comment? Absolutely nothing is assumed about the human mentor, certainly no access to the ground truth of the universe; it could be a two-year-old or a corpse! That would just make the Mentor-Level Performance Corollary less impressive.
I don’t deny that certain choices about the agent’s design make it intractable. This is why my main criticism was “People don’t seem to bother investigating or discussing whether their concerns with the proposal are surmountable.” Algorithm design for improved tractability is the bread and butter of computer science.
I’ll edit to comment to note that you dispute it, but I stand by the comment. The AI system trained is only as safe as the mentor, so the system is only safe if the mentor knows what is safe. By “restrict”, I meant for performance reasons, so that it’s feasible to train and deploy in new environments.
Again, I like your work and would like to see more similar work from you and others. I am just disputing the way you summarized it in this post, because I think that portrayal makes its lack of splash in the alignment community a much stronger point against the community’s epistemics than it deserves.
Thank you for the edit, and thank you again for your interest. I’m still not sure what you mean by a person “having access to the ground truth of the universe”. There’s just no sense I can think of where it is true that this a requirement for the mentor.
“The system is only safe if the mentor knows what is safe.” It’s true that if the mentor kills everyone, then the combined mentor-agent system would kill everyone, but surely that fact doesn’t weight against this proposal at all. In any case, more importantly a) the agent will not aim to kill everyone regardless of whether the mentor would (Corollary 14), which I think refutes your comment. And b) for no theorem in the paper does the mentor need to know what is safe; for Theorem 11 to be interesting, he just needs to act safely (an important difference for a concept so tricky to articulate!). But I decided these details were beside the point for this post, which is why I only cited Corollary 14 in the OP, not Theorem 11.
Do you have a minute to react to this? Are you satisfied with my response?