While not providing anything like a solution to the central issue here, I want to note that it looks likely to be the middle classes that get hollowed out first—human labour to do all kinds of physical tasks is likely to be valued for longer than various kinds of desk-based tasks, because scaling up and deploying robotics to replace them would take significant time, whereas scaling up the automation of desk-based tasks can be relatively quick.
Owen Cotton-Barratt
Thanks for exploring this, I found it quite interesting.
I’m worried that casual readers might come away with the impression “these dynamics of compensation for safety work being a big deal obviously apply to AI risk”. But I think this is unclear, because we may not have the key property (that you call assumption (b)).
Intuitively I’d describe this property as “meaningful restraint”, i.e. people are holding back a lot from what they might achieve if they weren’t worried about safety. I don’t think this is happening in the world at the moment. It seems plausible that it will never happen—i.e. the world will be approximately full steam ahead until it gets death or glory. In this case there is no compensation effect, and safety work is purely good in the straightforward way.
To spell out the scenario in which safety work now could be bad because of risk compensation: perhaps in the future everyone is meaningfully restrained, but if there’s been more work on how to build things safely done ahead of time, they’re less worried so less restrained. I think this is a realistic possibility. But I think that this world is made much safer by less variance in the models of different actors about how much risk there is, in order to avoid having the actor who is an outlier in not expecting risk being the one to press ahead. Relatedly, I think we’re much more likely to reach such a scenario if many people have got on a similar page about the levels of risk. But I think that a lot of “technical safety” work at the moment (and certainly not just “evals”) is importantly valuable for helping people to build common pictures of the character of risk, and how high risk levels are with various degrees of safety measure. So a lot of what people think of as safety work actually looks good even in exactly the scenario where we might get >100% risk compensation.
All of this isn’t to say “risk compensation shouldn’t be a concern”, but more like “I think we’re going to have to model this in more granularity to get a sense of when it might or might not be a concern for the particular case of technical AI safety work”.
A small point of confusion: taking U(C) = C (+ a constant) by appropriate parametrization of C is an interesting move. I’m not totally sure what to think of it; I can see that it helps here, but it makes it seem quite hard work to develop good intuitions about the shape of P. But the one clear intuition I have about the shape of P is that there should be some C>0 where P is 0, regardless of S, because there are clearly some useful applications of AI which pose no threat of existential catastrophe. But your baseline functional form for P excludes this possibility. I’m not sure how much this matters, because as you say the conclusions extend to a much broader class of possible functions (not all of which exclude this kind of shape), but the tension makes me want to check I’m not missing something?
Maybe? It seems a bit extreme for that; I think 5⁄6 of the “disagree” votes came in over a period of an hour or two mid-evening UK time. But it could certainly just be coincidence, or a group of people happening to discuss it and all disagree, or something.
OK actually there’s been a funny voting pattern on my top-level comment here, where I mostly got a bunch of upvotes and agree-votes, and then a whole lot of downvotes and disagree-votes in one cluster, and then mostly upvotes and agree-votes since then. Given the context, I feel like I should be more open than usual to a “shenanigans” hypothesis, which feels like it would be modest supporting evidence for the original conclusion.
Anyone with genuine disagreement—sorry if I’m rounding you into that group unfairly, and I’m still interested in hearing about it.
(If anyone disagreeing wants to get into explaining why, I’m interested. Honestly it would be more comforting to be wrong about this.)
When I first read this article I assumed it was written in good faith (and found it quite helpful). However, at this point I think it’s correct to assume that “Mark Fuentes” (an admitted pseudonym which has only been used to write about Torres) is misrepresenting their identity, and in particular likely has some substantial history of involvement with the EA community, and perhaps history of beef with Torres, rather than having come to this topic as a disinterested party.
This view is based on:
Torres’s claims about patterns they’ve seen in criticism (part 3 of this; evidence I take as suggestive but by no means conclusive)
Mark refusing to consider any steps to verify their identity, and instead inviting people to disregard the content in the section called “my story”
Some impressions I can’t fully unpack about the tone and focus of Mark’s comments on this post (and their private message to me) seeming better explained by them not having been a disinterested party than by them having been one
A view that we’re not supposed to give fully anonymous accounts the benefit of the doubt:
… in order not to be open to abuse by people claiming whatever identity most supports their points;
… because they’re not putting their reputation on the line;
… because the costs are smaller if they are incorrectly smeared (it doesn’t attach to any real person’s reputation).
With that assumption, I feel kind of upset. I’m not a fan of Torres, but I think grossly misrepresenting authorship is unacceptable, and it’s all the more important to call it out when it’s coming from someone I might otherwise find myself on the same side of an argument as. And while I expect that much of the content of the post is still valid, it’s harder to take at face value now that I more suspect that the examples have been adversarially selected.
- 11 May 2024 9:19 UTC; 12 points) 's comment on Émile P. Torres’s history of dishonesty and harassment by (
Hi Mark,
I wonder if you’d be willing to do something along the lines of privately verifying that your identity is roughly as described in your post? I think this could be pretty straightforward, and might help a bunch in making things clear and low-drama. (At present you’re stating that the claims about your identify are a fabrication, but there’s no way for external parties to verify this.)
I think from something like a game-theoretic perspective (i.e. to avoid creating incentives for certain types of escalation if someone is willing to engage in bad faith), absent some verification it will be reasonable for observers to assume that Torres is correct that the anonymous account “Mark Fuentes” is misrepresenting itself as a disinterested party. (Which would be relevant information for readers in interpreting the post, even if much of the content remained valid.)
Thanks for this exploration.
I do think that there are some real advantages to using the intentional stance for LLMs, and I think these will get stronger in the future when applied to agents built out of LLMs. But I don’t think you’ve contrasted this with the strongest version of the design stance. My feeling is that this is not taking humans-as-designers (which I agree is apt for software but not for ML), but taking the training-process-as-designer. I think this is more obvious if you think of an image classifier—it’s still ML, so it’s not “designed” in a traditional sense, but the intentional stance seems not so helpful compared with thinking of it as having been designed-by-the-training-process, to sort images into categories. This is analogous to understanding evolutionary adaptations of animals or plants as having been designed-by-evolution.
Taking this design stance on LLMs can lead you to “simulator theory”, which I think has been fairly helpful in giving some insights about what’s going on: https://www.lesswrong.com/tag/simulator-theory
I want to say thank you for holding the pole of these perspectives and keeping them in the dialogue. I think that they are important and it’s underappreciated in EA circles how plausible they are.
(I definitely don’t agree with everything you have here, but typically my view is somewhere between what you’ve expressed and what is commonly expressed in x-risk focused spaces. Often also I’m drawn to say “yeah, but …”—e.g. I agree that a treacherous turn is not so likely at global scale, but I don’t think it’s completely out of the question, and given that I think it’s worth serious attention safeguarding against.)
I might think of FHI as having borrowed prestige from Oxford. I think it benefited significantly from that prestige. But in the longer run it gets paid back (with interest!).
That metaphor doesn’t really work, because it’s not that FHI loses prestige when it pays it back—but I think the basic dynamic of it being a trade of prestige at different points in time is roughly accurate.
I’m worried I’m misunderstanding what you mean by “value density”. Could you perhaps spell this out with a stylized example, e.g. comparing two different interventions protecting against different sizes of catastrophe?
I think human extinction over 1 year is extremely unlikely. I estimated 5.93*10^-12 for nuclear wars, 2.20*10^-14 for asteroids and comets, 3.38*10^-14 for supervolcanoes, a prior of 6.36*10^-14 for wars, and a prior of 4.35*10^-15 for terrorist attacks.
Without having dug into them closely, these numbers don’t seem crazy to me for the current state of the world. I think that the risk of human extinction over 1 year is almost all driven by some powerful new technology (with residues for the wilder astrophysical disasters, and the rise of some powerful ideology which somehow leads there). But this is an important class! In general dragon kings operate via something which is mechanically different than the more tame parts of the distribution, and “new technology” could totally facilitate that.
Do you have a sense of the extent to which the dragon king theory applies in the context of deaths in catastrophes?
Unfortunately, for the relevant part of the curve (catastrophes large enough to wipe out large fractions of the population) we have no data, so we’ll be relying on theory. My understanding (based significantly just on the “mechanisms” section of that wikipedia page) is that dragon kings tend to arise in cases where there’s a qualitatively different mechanism which causes the very large events but doesn’t show up in the distribution of smaller events. In some cases we might not have such a mechanism, and in others we might. It certainly seems plausible to me when considering catastrophes (and this is enough to drive significant concern, because if we can’t rule it out it’s prudent to be concerned, and risk having wasted some resources if we turn out to be in a world where the total risk is extremely small), via the kind of mechanisms I allude to in the first half of this comment.
Sorry, I understood that you primarily weren’t trying to model effects on extinction risk. But I understood you to be suggesting that this methodology might be appropriate for what we were doing in that paper—which was primarily modelling effects on extinction risk.
Sorry, this isn’t speaking to my central question. I’ll try asking via an example:
Suppose we think that there’s a 1% risk of a particular catastrophe C in a given time period T which kills 90% of people
We can today make an intervention X, which costs $Y, and means that if C occurs then T will only kill 89% of people
We pay the cost $Y in all worlds, including the 99% in which C never occurs
When calculating the cost to save a life for X, do you:
A) condition on C, so you save 1% of people at the cost of $Y; or
B) don’t condition on C, so you save an expected 0.01% of people at a cost of $Y?
I’d naively have expected you to do B) (from the natural language descriptions), but when I look at your calculations it seems like you’ve done A). Is that right?
I think if you’re primarily trying to model effects on extinction risk, then doing everything via “proportional increase in population” and nowhere directly analysing extinction risk, seems like a weirdly indirect way to do it—and leaves me with a bunch of questions about whether that’s really the best way to do it.
Re.
Cotton-Barratt 2020 says “it’s usually best to invest significantly into strengthening all three defence layers”:
“This is because the same relative change of each probability will have the same effect on the extinction probability”. I agree with this, but I wonder whether tail risk is the relevant metric. I think it is better to look into the expected value density of the cost-effectiveness of saving a life, accounting for indirect longterm effects as I did. I predict this expected value density to be higher for the 1st layers, which respect a lower severity, but are more likely to be requested. So, to equalise the marginal cost-effectiveness of additional investments across all layers, it may well be better to invest more in prevention than in response, and more in response than in resilience.
That paper was explicitly considering strategies for reducing the risk of human extinction. I agree that relative to the balance you get from that, society should skew towards prioritizing response and especially prevention, since these are also important for many of society’s values that aren’t just about reducing extinction risk.
I’m worried that modelling the tail risk here as a power law is doing a lot of work, since it’s an assumption which makes the risk of very large events quite small (especially since you’re taking a power law in the ratio, aside from the threshold from requiring a certain number of humans to have a viable population, the structure of the assumption essentially gives that extinction is impossible).
But we know from (the fancifully named) dragon king theory that the very largest events are often substantially larger than would be predicted by power law extrapolation.
I’m confused by some of the set-up here. When considering catastrophes, your “cost to save a life” represents the cost to save that life conditional on the catastrophe being due to occur? (I’m not saying “conditional on occurring” because presumably you’re allowed interventions which try to avert the catastrophe.)
Understood this way, I find this assumption very questionable:
, since I feel like the effect of having more opportunities to save lives in catastrophes is roughly offset by the greater difficulty of preparing to take advantage of those opportunities pre-catastrophe.
Or is the point that you’re only talking about saving lives via resilience mechanisms in catastrophes, rather than trying to make the catastrophes not happen or be small? But in that case the conclusions about existential risk mitigation would seem unwarranted.
I can’t speak for Elizabeth, but I also find that that paragraph feels off, for reasons something like:
Conflation of “counterfactual money to high-impact charities” with “your impact”
Maybe even if it’s counterfactually moved, you don’t get to count all the impact from it as your impact, since to avoid double-counting and impact ponzi schemes it’s maybe important to take a “share-of-the-pie” approach to thinking about your impact (here’s my take on that general question), and presumably they get a lot of the credit for their giving
Plus, maybe you do things which are importantly valuable that aren’t about your pledge! It’s at least a plausible reading (though it’s ambiguous) that “double your impact” would be taken as “double your lifetime impact”
As well as sharing credit for their donations with them, you maybe need to share credit for having nudged them to make the pledge with other folks (including but not limited to GWWC)
As you say, their donations may not be counterfactual even in the short-term
Even if a good fraction of them are maybe from outside the community, that’s still a fraction by which it reduces expected impact
Although on average I think it’s likely very good, I’m sure in some cases the EA push towards a few charities that have been verified as highly effective actually does harm by pulling people to give to those over some other charities which were in fact even more effective (but illegibly so)
Man, long-term counterfactuals are hard
Maybe GWWC/EA ends up growing a lot further, so that it reaches effective saturation among ~all relevant audiences
In that world, if someone was open to taking the GWWC pledge, they’d likely do it eventually, even if they are currently not at all connected to the community
Now, none of these points are blatant errors, or make me want to say “what were you thinking?!?”. But I feel taken together the picture is that in fact there’s a lot of complexity to the question of how impact should be counted in that case, and the text doesn’t help the reader to understand that there’s a lot of complexity or how to navigate thinking about it, but instead cheerfully presents the most favourable possible interpretation. It just has a bit of a vibe of slightly-underhand sales tactics, or something?