Hey, I think this sort of work can be really valuable—thanks for doing it, and (Tristan) for reaching out about it the other day!
I wrote up a few pages of comments here (initially just for Tristan but he said he’d be fine with me posting it here). Some of them are about nitpicky typos that probably won’t be of interest to anyone but the authors, but I think some will be of general interest.
Despite its length, even this batch of comments just consists of what stood out on a quick skim; there are whole sections (especially of the appendix) that I’ve barely read. But in short, for whatever it’s worth:
I think that a model roughly in this direction is largely on the right track, if you think you can allocate the entire AI safety budget (and think that the behavior of other relevant actors, like AI developers, is independent of what you do). If so, you can frame the problem as an optimization problem, as you have done, and build in lots of complications. If not, though—i.e. if you’re trying to allocate only some part of the AI safety budget, in light of what other actors are doing (and how they might respond to your own decisions)—you have to frame the problem as a game, at which point it quickly loses tractability as you build in complications. (My own approach has been to think about the problem of allocating spending over time as a simple game, and this is part of what accounts for the different conclusions, as noted at the top of the doc.) I don’t know if the “only one big actor” simplification holds closely enough in the AI safety case for the “optimization” approach to be a better guide, but it may well be.
That said, I also think that this model currently has mistakes large enough to render the quantitative conclusions unreliable. For example, the value of spending after vs. before the “fire alarm” seems to depend erroneously on the choice of units of money. (This is the second bit of red-highlighted text in the linked Google doc.) So I’d encourage someone interested in quantifying the optimal spending schedule on AI safety to start with this model, but then comb over the details very carefully.
Thanks again Phil for taking the read this through and for the in-depth feedback.
I hope to take some time to create a follow-up post, working in your suggestions and corrections as external updates (e.g. to the parameters of lower total AI risk funding, shorter Metaculus timelines).
I don’t know if the “only one big actor” simplification holds closely enough in the AI safety case for the “optimization” approach to be a better guide, but it may well be.
This is a fair point.
The initial motivator for the project was for AI s-risk funding, of which there’s pretty much one large funder (and not much work is done on AI s-risk reduction outside of people and organizations and people outside the effective altruism community) though this result is entirely on AI existential risk, which is less well modeled as a single actor.
My intuition is that the “one big actor” does work sufficiently well for the AI risk community given the shared goal (avoid an AI existential catastrophe) and my guess that a lot of the AI risk done by the community doesn’t change the behaviour of AI labs much (i.e. it could be that they choose to put more effort into capabilities over safety because of work done by the AI risk community, but I’m pretty sure this isn’t happening).
For example, the value of spending after vs. before the “fire alarm” seems to depend erroneously on the choice of units of money. (This is the second bit of red-highlighted text in the linked Google doc.) So I’d encourage someone interested in quantifying the optimal spending schedule on AI safety to start with this model, but then comb over the details very carefully.
To comment on this particular error (though not to say that other errors Phil points to are not also unproblematic—I’ve yet to properly go through them), for what it’s worth, the main results of the post suppose zero post fire alarm spending[1] and (fortunately) since in our results we use units of millions of dollars and take the initial capital to be on the order of 1000 $m, I don’t think we face this problem of smaller η having the reverse than desired effect for
In a future version I expect I’ll just take the post-fire alarm returns to spending to use the same returns exponent η from before the fire alarm but have some multiplier—i.e. xη returns to spending before the fire-alarm and kxη afterwards.
Hey, I think this sort of work can be really valuable—thanks for doing it, and (Tristan) for reaching out about it the other day!
I wrote up a few pages of comments here (initially just for Tristan but he said he’d be fine with me posting it here). Some of them are about nitpicky typos that probably won’t be of interest to anyone but the authors, but I think some will be of general interest.
Despite its length, even this batch of comments just consists of what stood out on a quick skim; there are whole sections (especially of the appendix) that I’ve barely read. But in short, for whatever it’s worth:
I think that a model roughly in this direction is largely on the right track, if you think you can allocate the entire AI safety budget (and think that the behavior of other relevant actors, like AI developers, is independent of what you do). If so, you can frame the problem as an optimization problem, as you have done, and build in lots of complications. If not, though—i.e. if you’re trying to allocate only some part of the AI safety budget, in light of what other actors are doing (and how they might respond to your own decisions)—you have to frame the problem as a game, at which point it quickly loses tractability as you build in complications. (My own approach has been to think about the problem of allocating spending over time as a simple game, and this is part of what accounts for the different conclusions, as noted at the top of the doc.) I don’t know if the “only one big actor” simplification holds closely enough in the AI safety case for the “optimization” approach to be a better guide, but it may well be.
That said, I also think that this model currently has mistakes large enough to render the quantitative conclusions unreliable. For example, the value of spending after vs. before the “fire alarm” seems to depend erroneously on the choice of units of money. (This is the second bit of red-highlighted text in the linked Google doc.) So I’d encourage someone interested in quantifying the optimal spending schedule on AI safety to start with this model, but then comb over the details very carefully.
Thanks again Phil for taking the read this through and for the in-depth feedback.
I hope to take some time to create a follow-up post, working in your suggestions and corrections as external updates (e.g. to the parameters of lower total AI risk funding, shorter Metaculus timelines).
This is a fair point.
The initial motivator for the project was for AI s-risk funding, of which there’s pretty much one large funder (and not much work is done on AI s-risk reduction outside of people and organizations and people outside the effective altruism community) though this result is entirely on AI existential risk, which is less well modeled as a single actor.
My intuition is that the “one big actor” does work sufficiently well for the AI risk community given the shared goal (avoid an AI existential catastrophe) and my guess that a lot of the AI risk done by the community doesn’t change the behaviour of AI labs much (i.e. it could be that they choose to put more effort into capabilities over safety because of work done by the AI risk community, but I’m pretty sure this isn’t happening).
To comment on this particular error (though not to say that other errors Phil points to are not also unproblematic—I’ve yet to properly go through them), for what it’s worth, the main results of the post suppose zero post fire alarm spending[1] and (fortunately) since in our results we use units of millions of dollars and take the initial capital to be on the order of 1000 $m, I don’t think we face this problem of smaller η having the reverse than desired effect for
In a future version I expect I’ll just take the post-fire alarm returns to spending to use the same returns exponent η from before the fire alarm but have some multiplier—i.e. xη returns to spending before the fire-alarm and kxη afterwards.
Though if one thinks there will many good opportunities to spend after a fire alarm, our main no-fire-alarm results would likely be an overestimate