Thanks Jonas! I really appreciate your constructive engagement.
I’m not very sympathetic to pure time preference – in fact, most philosophers hate pure time preference and are leading the academic charge against it. I’m also not very sympathetic to person-affecting views, though many philosophers whom I respect are more sympathetic. I don’t rely on either of these views in my arguments, because I think they are false.
In general, my impression on the topic of advancing progress is that longtermists have not yet identified plausible interventions which could do this well enough to be worth funding. There has certainly been a lot written about the value of advancing progress, but not many projects have been carried out to actually do this. That tends to suggest to me that I should hold off on commenting about the value of advancing progress until we have more concrete proposals on the table. It also tends to suggest that advancing progress may be hard.
I think that covers most of the content of Toby’s posts (advancing progress and temporal discounting). Perhaps one more thing that would be important to stress is that I would really like to see Oxford philosophers following the GPI model of putting out smaller numbers of rigorous academic papers rather than larger numbers of forum posts and online reports. It’s not very common for philosophers to engage with one another on internet forums – such writing is usually regarded as a form of public outreach. My first reaction to EA Forum posts by Toby and others would be that I’d like to see them written up as research papers. When ideas are written up online before they are published in academic venues, we tend to be skeptical that the research has actually been done or that it would hold up to scrutiny if it were.
On power-seeking, one of the most important features of academic papers is that they aim to tackle a single topic in depth, rather than a broad range of topics in less depth. So for example, my power-seeking paper does not engage with evolutionary arguments by Hendrycks and others because they are different arguments and would need to be addressed in a different paper.
To be honest, my selection of arguments to engage with is largely driven by the ability of those making the arguments to place them in top journals and conferences. Academics tend to be skeptical of arguments that have not cleared this bar, and there is little career value in addressing them. The reason why I addressed the singularity hypothesis and power-seeking is that both have had high-profile backing by quality authors (Bostrom, Chalmers) or in high-profile conferences (NeurIPS). I’d be happy to engage with evolutionary arguments when they clear this bar, but they’re a good ways off yet and I’m a bit skeptical that they will clear it.
I don’t want to legislate a particular definition of power, because I don’t think there’s a good way to determine what that should be. My concern is with the argument from power-seeking. This argument needs to use a notion of power such that AI power-seeking would lead to existentially catastrophic human disempowerment. We can define power in ways that make power-seeking easier to establish, but this mostly passes the buck, since we will then need to argue why power-seeking in this sense would lead to existentially catastrophic human disempowerment.
One example of this kind of buck-passing is in the Turner et al. papers. For Turner and colleagues, power is (roughly) the ability to achieve valuable outcomes or to promote an agent’s goals. It’s not so hard to show that agents will tend to be power-seeking in this sense, but this also doesn’t imply that the results will be existentially catastrophic without further argument, and that argument will probably take us substantially beyond anything like the traditional argument from power-seeking.
Likewise, we might, as you suggest, think about power-seeking as free-energy minimization and adopt a view on which most or all systems seek to minimize free energy. But precisely because this view takes free energy minimization to be common, it won’t put us very far on the way towards arguing that the results will be existentially catastrophic, nor even that they will involve human disempowerment (in the sense of failure to minimize free energy).
I agree with you that it is unfair to conclude much from the failure of MIRI arguments. The discussion of the MIRI paper is cut from the version of the power-seeking paper that I submitted to a journal. Academic readers are likely to have had high credence that the MIRI argument was bad before reading it, so they will not update much on the failure of this argument. I included this argument in the extended version of the paper because I know that some effective altruists were moved by it, and also because I didn’t know of any other formal work that broke substantially from the Turner et al. framework. I think that the MIRI argument is not anywhere near as good as the Turner et al. argument. While I have many disagreements with Turner et al., I want to stress that their results are mathematically sophisticated and academically rigorous. The MIRI paper is netiher.
Thank you for that substantive response, I really appreciate it! It was also very nice that you mentioned the Turner et.al definitions, I wasn’t expecting that.
(Maybe write a post on that? There’s a comment that mentions uptake from major players in the EA ecosystem and maybe if you acknowledge you understand the arguments they would be more sympathetic? Just a quick thought but it might be worth engaging there a bit more?)
I just wanted to clarify some of the points I was trying to make yesterday as I do realise that they didn’t all get across as I wanted them to.
I completely agree with you on the advancing progress point, I personally am quite against it from a “general”-level, I do not believe that we will be able to counterfactually change the “rowing” speed that much in the grand scheme of things. I also believe that is the conclusion of Toby’s posts if I remember correctly. Toby was rather stating that existential risk reduction is worth a lot compared to any progress that we might be able to make. “Steering” away from the bad stuff is worth more. (That’s the implicit claim from the modelling even though he’s as epistemically humble as you philosophers always are (which is commendable!).)
Now for the power-seeking stuff. I appreciate your careful reasoning about these things and I see what you mean in that there’s no threat model from that claim in itself. If we say that the classical way it is construed is something that is equivalent to minimizing free energy, this is a tautological statement and doesn’t help for existential risk.
I think I can agree with you that we’re not clear enough about the existential risk angle to have a clearly defined goal for what to do. I do think there’s an argument there but that we have to be quite clear with how we’re defining it for it to make foundational sense. A question that arises is if in the process of working on it we get more clarity about what it fundamentally is, similar to a startup figuring out what they’re doing along the way? It might still be worth the resources from a unknown unknown perspective and institutional practices shifting perspective if that makes sense? TAI is such a big thing and it will only happen once so spending those resources on relatively shaky foundations might still make sense?
I’m, however, not sure that this is the case and Wei Dai for example has an entire agenda about “metaphilosophy” where the claim is that we’re too philosophically confused to make sense of alignment. In general, I would agree that ensuring the philosophical and mathematical basis is very important to coordinate the field and it is something I’ve been thinking about for a while.
I personally am trying to import ideas from existing fields that deal with generally intelligent agents in biology and cognitive science such as Active Inference and Computational Biology into the mix to see how TAI will affect society. If we see smaller branches of science as specific offshoots of philosophy then I think the places with the most rigorous thinking on the foundations are the ones that have dealt with it for a long time. I’ve found a lot of interesting models about misalignment in these areas that I think can be transported into the AI Safety frame.
I really appreciate the deconstructive approach that you have to the intellectual foundations of the field. I do believe that there are alternatives to the classic risk story but you have to some extent break down the flaws in the existing arguments in order to advocate for new arguments.
Finally, where I think these threat models come from are arguments similar to the ones in What Failure Looks Like from Paul Christiano and the going out with a wimper idea. This is also explored in Yuval Noah Harari’s books Nexus and Homo Deus. This threat model is more similar to the authoritian capture idea compared to something like a runaway intelligence explosion.
I’m looking forward to more work in this area from you!
Thanks Jonas! I really appreciate your constructive engagement.
I’m not very sympathetic to pure time preference – in fact, most philosophers hate pure time preference and are leading the academic charge against it. I’m also not very sympathetic to person-affecting views, though many philosophers whom I respect are more sympathetic. I don’t rely on either of these views in my arguments, because I think they are false.
In general, my impression on the topic of advancing progress is that longtermists have not yet identified plausible interventions which could do this well enough to be worth funding. There has certainly been a lot written about the value of advancing progress, but not many projects have been carried out to actually do this. That tends to suggest to me that I should hold off on commenting about the value of advancing progress until we have more concrete proposals on the table. It also tends to suggest that advancing progress may be hard.
I think that covers most of the content of Toby’s posts (advancing progress and temporal discounting). Perhaps one more thing that would be important to stress is that I would really like to see Oxford philosophers following the GPI model of putting out smaller numbers of rigorous academic papers rather than larger numbers of forum posts and online reports. It’s not very common for philosophers to engage with one another on internet forums – such writing is usually regarded as a form of public outreach. My first reaction to EA Forum posts by Toby and others would be that I’d like to see them written up as research papers. When ideas are written up online before they are published in academic venues, we tend to be skeptical that the research has actually been done or that it would hold up to scrutiny if it were.
On power-seeking, one of the most important features of academic papers is that they aim to tackle a single topic in depth, rather than a broad range of topics in less depth. So for example, my power-seeking paper does not engage with evolutionary arguments by Hendrycks and others because they are different arguments and would need to be addressed in a different paper.
To be honest, my selection of arguments to engage with is largely driven by the ability of those making the arguments to place them in top journals and conferences. Academics tend to be skeptical of arguments that have not cleared this bar, and there is little career value in addressing them. The reason why I addressed the singularity hypothesis and power-seeking is that both have had high-profile backing by quality authors (Bostrom, Chalmers) or in high-profile conferences (NeurIPS). I’d be happy to engage with evolutionary arguments when they clear this bar, but they’re a good ways off yet and I’m a bit skeptical that they will clear it.
I don’t want to legislate a particular definition of power, because I don’t think there’s a good way to determine what that should be. My concern is with the argument from power-seeking. This argument needs to use a notion of power such that AI power-seeking would lead to existentially catastrophic human disempowerment. We can define power in ways that make power-seeking easier to establish, but this mostly passes the buck, since we will then need to argue why power-seeking in this sense would lead to existentially catastrophic human disempowerment.
One example of this kind of buck-passing is in the Turner et al. papers. For Turner and colleagues, power is (roughly) the ability to achieve valuable outcomes or to promote an agent’s goals. It’s not so hard to show that agents will tend to be power-seeking in this sense, but this also doesn’t imply that the results will be existentially catastrophic without further argument, and that argument will probably take us substantially beyond anything like the traditional argument from power-seeking.
Likewise, we might, as you suggest, think about power-seeking as free-energy minimization and adopt a view on which most or all systems seek to minimize free energy. But precisely because this view takes free energy minimization to be common, it won’t put us very far on the way towards arguing that the results will be existentially catastrophic, nor even that they will involve human disempowerment (in the sense of failure to minimize free energy).
I agree with you that it is unfair to conclude much from the failure of MIRI arguments. The discussion of the MIRI paper is cut from the version of the power-seeking paper that I submitted to a journal. Academic readers are likely to have had high credence that the MIRI argument was bad before reading it, so they will not update much on the failure of this argument. I included this argument in the extended version of the paper because I know that some effective altruists were moved by it, and also because I didn’t know of any other formal work that broke substantially from the Turner et al. framework. I think that the MIRI argument is not anywhere near as good as the Turner et al. argument. While I have many disagreements with Turner et al., I want to stress that their results are mathematically sophisticated and academically rigorous. The MIRI paper is netiher.
I hope that helps!
Thank you for that substantive response, I really appreciate it! It was also very nice that you mentioned the Turner et.al definitions, I wasn’t expecting that.
(Maybe write a post on that? There’s a comment that mentions uptake from major players in the EA ecosystem and maybe if you acknowledge you understand the arguments they would be more sympathetic? Just a quick thought but it might be worth engaging there a bit more?)
I just wanted to clarify some of the points I was trying to make yesterday as I do realise that they didn’t all get across as I wanted them to.
I completely agree with you on the advancing progress point, I personally am quite against it from a “general”-level, I do not believe that we will be able to counterfactually change the “rowing” speed that much in the grand scheme of things. I also believe that is the conclusion of Toby’s posts if I remember correctly. Toby was rather stating that existential risk reduction is worth a lot compared to any progress that we might be able to make. “Steering” away from the bad stuff is worth more. (That’s the implicit claim from the modelling even though he’s as epistemically humble as you philosophers always are (which is commendable!).)
Now for the power-seeking stuff. I appreciate your careful reasoning about these things and I see what you mean in that there’s no threat model from that claim in itself. If we say that the classical way it is construed is something that is equivalent to minimizing free energy, this is a tautological statement and doesn’t help for existential risk.
I think I can agree with you that we’re not clear enough about the existential risk angle to have a clearly defined goal for what to do. I do think there’s an argument there but that we have to be quite clear with how we’re defining it for it to make foundational sense. A question that arises is if in the process of working on it we get more clarity about what it fundamentally is, similar to a startup figuring out what they’re doing along the way? It might still be worth the resources from a unknown unknown perspective and institutional practices shifting perspective if that makes sense? TAI is such a big thing and it will only happen once so spending those resources on relatively shaky foundations might still make sense?
I’m, however, not sure that this is the case and Wei Dai for example has an entire agenda about “metaphilosophy” where the claim is that we’re too philosophically confused to make sense of alignment. In general, I would agree that ensuring the philosophical and mathematical basis is very important to coordinate the field and it is something I’ve been thinking about for a while.
I personally am trying to import ideas from existing fields that deal with generally intelligent agents in biology and cognitive science such as Active Inference and Computational Biology into the mix to see how TAI will affect society. If we see smaller branches of science as specific offshoots of philosophy then I think the places with the most rigorous thinking on the foundations are the ones that have dealt with it for a long time. I’ve found a lot of interesting models about misalignment in these areas that I think can be transported into the AI Safety frame.
I really appreciate the deconstructive approach that you have to the intellectual foundations of the field. I do believe that there are alternatives to the classic risk story but you have to some extent break down the flaws in the existing arguments in order to advocate for new arguments.
Finally, where I think these threat models come from are arguments similar to the ones in What Failure Looks Like from Paul Christiano and the going out with a wimper idea. This is also explored in Yuval Noah Harari’s books Nexus and Homo Deus. This threat model is more similar to the authoritian capture idea compared to something like a runaway intelligence explosion.
I’m looking forward to more work in this area from you!