I share the feeling that advocates of short timelines often overestimated the reliability of their methods, and have said so here.
At the same time, when I see skeptics of AI progress talk about these arguments lack “evidence” its unclear to me what the neutral criteria are for what counts as “evidence”. I agee that lots of the dynamics you describe exist, but they don’t seem at all unique to discussions of AI timelines to me. I think they stem from the fact that interpreting evidence is much messier than an academic, ivory tower view of “evidence” would make it seem.
As an example, a common critique I have noted among AI skeptics is that arguments for short timelines or AI risk don’t follow traditional academic proccess such as pre-publication peer review. The implication is that this suggests they lack rigor and in some sense shouldn’t count as “real” evidence. But it seems to me that by this standard peer review itself lacks evidence it support its practice. See the various reproducibility projects with underwhelming results and the replication crisis in general. Yes things like METR’s work have limitations and don’t precisely replicate the ideal experiment, but nor do animal models or cell lines precisely replicate human disease as seen in the clinic. Benchmarks can be gamed, but the idea of “benchmarks” comes from the machine learning literature itself, it isn’t something cooked up specifically to argue for short timelines.
What are the standards that you think an argument should meet to count as “evidence-based”?
The purpose of peer-review is to make sure that the publication has no obvious errors and meets some basic standards of publication. I have been a peer-reviewer myself, and what I have seen is that the general quality of stuff sent to computer science conferences is low. Peer-review removes the most blatantly bad papers. To a layperson who doesn’t know the field and who cannot judge the quality of studies, it is safest to stick to peer-reviewed papers.
But it has never been suggested that peer-review somehow magically separates good evidence from bad evidence. In my work, I often refer to arXiv papers that are not peer-reviewed, but which I believe are methodologically sound and present valuable contributions. On the other hand, I know that conferences and journals often publish papers even with grave methodological errors or lack of statistical understanding.
Ultimately, the real test of a study is the criticism it receives after its publication, not peer-review. If researchers in the field think that the study is good and build their research on it, it is much more credible evidence than a study that is disproved by studies that come after it. One should never rely on a single study alone.
In case of METR’s study, their methodological errors do not preclude that their conclusions are correct. I think what they are trying to do is interesting and worth of research. I’d love to see other researchers attempt to replicate the study while improving on methodology, and if they succeed in having similar results, providing evidence for METR’s conclusions. So far, we haven’t seen this (or at least I am not aware of). Although even in that case, the problem of evidence mismatch stays, and we should be careful not to draw those conclusions to far.
Ultimately, the real test of a study is the criticism it receives after its publication, not peer-review. If researchers in the field think that the study is good and build their research on it, it is much more credible evidence than a study that is disproved by studies that come after it. One should never rely on a single study alone.
This seems reasonable to me, but I don’t think its necesarily entirely consistent with the OP. I think a lot of the reason why AI is such a talked about topic compared to 5 years ago is that people have seen work that has gone on in the field and are building on and reacting to it. In other words, they perceive existing results to be evidence of significant progress and opportunities. They could be overreaching to or overhyping those results, but to me it doesn’t seem fair to say that the belief in short timelines is entirely “non-evidence-based”. Things like METR’s work, scaling laws, benchmarks, these are evidence even if they aren’t necesarily strong or definitive evidence.
I think it is reasonable to disagree with the conclusions that people draw based on these things, but I don’t entirely understand the argument that these things are “non-evidence-based”. I think it is worthwhile to distinquish between a disagreement over methodology, evidence strength, or interpretation, and the case where an argument is literally completely free of any evidence or substantiation whatsoever. In my view, arguments for short timelines contain evidence, but that doesn’t mean that their conclusions are correct.
In my post, I referred to the concept of “evidence-based policy making”. In this context, evidence refers specifically to rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes. Scientific evidence, as I said, referring to high-quality studies corroborated by other studies. And, as I emphasize the point of evidence mismatch, using a study that concludes something as evidence for something else is a fallacy.
The idea that current progress in AI can be taken as evidence for AGI, which in some sense is the most extreme progress in AI imaginable, incomparable to current progress, is an extraordinary claim that requires extraordinary evidence. People arguing for this are mostly basing their argument on their intuition and guesses, yet they often demand drastic actions over their beliefs. We, as the EA community, should make decisions based on evidence. Currently, people are providing substantial funding to the “AI cause” based on arguments that do not meet the bar of evidence-based policy, and I think that is something that should and must be criticized.
It seems like the core of your argument is saying that there is a high burden of proof that hasn’t been met. I agree that arguments for short timelines haven’t met a high burden of proof but I don’t believe that there is such a burden. I will try to explain my reasoning, although I’m not sure if I can do the argument justice in a comment, perhaps I will try to write a post about the issue.
When it comes to policy, I think the goal should be to make good decisions. You don’t get any style points for how good your arguments or evidence are if the consequences of your decisions are bad. That doesn’t mean we shouldn’t use evidence to make decisions, we certainty should. But the reason is that using evidence will improve the quality of the decision, not for “style points” so-to-speak.
Doing nothing and sticking with the status quo is also a decision that can have important consequences. We can’t just magically have more rigorous evidence, we have to make decisions and allocate resources in order to get that evidence. That also requires making decisions about the allocation of resources. When we make those decisions, we have to live with the uncetainty that we face, and make the best decision given that uncertainty. If we don’t have solid scientific evidence, we still have to make some decision. It isn’t optional. Sticking with the status quo is still making a decision. If we lack scientific evidence, then that policy decision won’t be evidence-based even if we do nothing. I think we should make the best decision we can given what information we have instead of defaulting to an informal burden of proof. If there is a formal burden of proof, like a burden on one party in a court case or a procedure for how an administrative or legislative body should decide, then in my view that formal procedure establishes what the burden of proof is.
The idea that current progress in AI can be taken as evidence for AGI
Although I believe there should be policy action/changes in response to the risk from AI, I personally don’t see the case for this as hinging on the achievement of “AGI”. I’ve described my position as being more concerned about “powerful” AI than “intelligent” AI. I think focusing on “AGI” or how “intelligent” an AI system is or will be often leads to unproductive rabbit holes or definition debates. On the other hand, obviously lots of AI risk advocates do focus on AGI, so I acknowledge it is completely fair game for skeptics to critique this.
Do you think you would be more open to some types of AI policy if the case for those policies didn’t rely on the emergence of “AGI”?
But the reason is that using evidence will improve the quality of the decision, not for “style points” so-to-speak.
No one has ever claimed that evidence should be collected for “style points”.
We can’t just magically have more rigorous evidence, we have to make decisions and allocate resources in order to get that evidence.
Fortunately, AI research has a plenty of funding right now (without any EA money), so in principle getting evidence should not be an issue. I am not against research, I am a proponent of it.
Doing nothing and sticking with the status quo is also a decision that can have important consequences. [...] If we lack scientific evidence, then that policy decision won’t be evidence-based even if we do nothing.
Sticking with status quo is often the best decision. When deciding how to use funds efficiently, you have to consider the opportunity cost of using those funds to something that has a certain positive benefit. And that alternative action is evidence-based. Thus, the dichotomy between “acting on AI without evidence” and “doing nothing without evidence” is false, the options are actually “acting on AI without evidence” and “acting on another cause area with evidence”.
If the estimated value of using the money for AI is below the benefit of the alternative, we should not use it for AI and instead stick to the status quo on that matter. Most AI interventions are not tractable, and due to this their actual utility might even be negative.
Do you think you would be more open to some types of AI policy if the case for those policies didn’t rely on the emergence of “AGI”?
Yes, there are several types of AI policy I support. However, I don’t think they are important cause areas for EA.
Fortunately, AI research has a plenty of funding right now (without any EA money), so in principle getting evidence should not be an issue. I am not against research, I am a proponent of it.
AI certainly has a lot of resources available, but I don’t think those resources are primarily being used to understand how AI will impact society. I think policy could push more in this direction. For example, requiring AI companies who train/are training models above a certain compute budget to undergo third-party audits of their training process and models would push towards clarifying some of these issues in my view.
Sticking with status quo is often the best decision. When deciding how to use funds efficiently, you have to consider the opportunity cost of using those funds to something that has a certain positive benefit. And that alternative action is evidence-based. Thus, the dichotomy between “acting on AI without evidence” and “doing nothing without evidence” is false, the options are actually “acting on AI without evidence” and “acting on another cause area with evidence”.
The conclusion that cause A is preferable to cause B involves the uncertainty about both causes. Even if cause A has more rigorous evidence than cause B, that doesn’t mean the conclusion that benefits(A) > benefits(B) is similarly rigorous.
Lets take AI and global health and development (GHD) as an example. I think it would be reasonable to say that evidence for GHD is much more rigorous and scientific than the evidence for AI. Yet that doesn’t mean that the evidence conclusively shows benefits(GHD) > benefits(AI). Lets say that someone believes that the evidence for GHD is scientific and the evidence for AI is not (or at least much less so), but that the overall, all-things-considered best estimate of benefits(AI) are greater than the best estimate of benefits(GHD). I think many people in the EA community in fact have this view. Do you think those people should still prefer GHD because AI is off limits due to not being “scientific”? I would consider this to be “for style points”, and disagree with this approach.
I will caveat this by saying that in my opinion it makes sense for estimation purposes to discount or shrink estimates of highly uncertainty quantities, which I think many advocates of AI as a cause fail to do and can be fairly criticized for. But the issue is a quantitative one, and so can come out either way. I think there is a difference between saying that we should heavily shrink estimates related to AI due to their uncertainty and lower quality evidence, vs saying that they lack any evidence whatsoever.
If the estimated value of using the money for AI is below the benefit of the alternative, we should not use it for AI and instead stick to the status quo on that matter.
I agree, but it doesn’t follow from one cause being “scientific” while the other isn’t that the “scientific” cause area has higher benefits.
Most AI interventions are not tractable, and due to this their actual utility might even be negative.
I actually agree that tractability is (ironically) a strongly neglected factor and many proponents of AI as a cause area ignore or vastly overestimate the tractability of AI interventions, including the very real possibility that they are counterproductive/net-negative. I still think there are worthwhile opportunities but I agree that this is an underappreciated downside of AI as a cause area.
Yes, there are several types of AI policy I support. However, I don’t think they are important cause areas for EA.
Can I ask why? Do you think AI won’t be a “big deal” in the reasonably near future?
I think many people in the EA community in fact have this view. Do you think those people should still prefer GHD because AI is off limits due to not being “scientific”? I would consider this to be “for style points”, and disagree with this approach.
It seems you have an issue with the word “scientific” and are constructing a straw-man argument around it. This has nothing to do with “style points”. As I have already explained, by scientific I only refer to high-quality studies that withstand scrutiny. If a study doesn’t, then it’s value as evidence is heavily discounted, as the probability of the conclusions of the study being right despite methodological errors, failures to replicate it, etc. is lower than if the study does not have these issues. If a study hasn’t been scrutinized at all, it is likely bad, because the amount of bad research is greater than the amount of good research (for example, if we look at the rejection rates of journals/conferences), and lack of scrutiny implies lack of credibility as researchers do not take the study seriously enough to scrutinize it.
The conclusion that cause A is preferable to cause B involves the uncertainty about both causes. Even if cause A has more rigorous evidence than cause B, that doesn’t mean the conclusion that benefits(A) > benefits(B) is similarly rigorous.
Yet E[benefits(A)] > E[benefits(B)] is a rigorous conclusion, because the uncertainty can be factored into the expected value.
Can I ask why? Do you think AI won’t be a “big deal” in the reasonably near future?
The International AI Safety Report lists many realistic threats (the first one of those is deepfakes, to give an example). Studying and regulating these things is nice, but they are not effective interventions in terms of lives saved etc.
I’m really at a loss here. If your argument is taken literally, I can convince you to fund anything, since I can give you highly uncertain arguments for almost everything. I cannot believe this is really your stance. You must agree with me that uncertainty affects decision making. It only seems that the word “scientific” bothers you for some reason, which I cannot really understand either. Do you believe that methodological errors are not important? That statistical significance is not required? That replicability does matter? To object to the idea that these issues cause uncertainty is absurd.
It seems you have an issue with the word “scientific” and are constructing a straw-man argument around it.
The “scientific” phrasing frustrates me because I feel like it is often used to suggest high rigor without actually demonstrating that such rigor actually applies to a give situation, and because I feel like it is used to exclude certain categories of evidence when those categories are relevant, even if they are less strong compared to other kinds of evidence. I think we should weigh all relevant evidence, not exclude cetain pieces because they aren’t scientific enough.
Yet E[benefits(A)] > E[benefits(B)] is a rigorous conclusion, because the uncertainty can be factored into the expected value.
Yes, but in doing so the uncertainty in both A and B matters, and showing that A is lower variance than B doesn’t show that E[benefits(A)] > E[benefits(B)]. Even if benefits(B) are highly uncertain and we know benefits(A) extremely precsiely, it can still be the case that benefits(B) are larger in expectation.
I cannot believe this is really your stance. You must agree with me that uncertainty affects decision making.
In my comment that you are responding to, I say:
The conclusion that cause A is preferable to cause B involves the uncertainty about both causes.
I also say:
I will caveat this by saying that in my opinion it makes sense for estimation purposes to discount or shrink estimates of highly uncertainty quantities
What about these statements makes you think that I don’t believe uncertainty affects decision making? It seems like I say that it does affect decision making in my comment.
If stock A very likely has a return in the range of 1-2%, and stock B very likely has a return in the range of 0-10%, do you think stock A must have a better expected return because it has lower uncertainty?
Yes uncertainty matters but it is more complicated than saying that the least uncertain option is always better. Sometimes the option that has less rigorous support is still better in an all-things-considered analysis.
If your argument is taken literally, I can convince you to fund anything, since I can give you highly uncertain arguments for almost everything.
I don’t think my argument leads to this conclusion. I’m just saying that AI risk has some evidence behind it, even if it isn’t the most rigorous evidence! That’s why I’m being such a stickler about this! If it were true that AI risk has actually zero evidence then of course I wouldn’t buy it! But I don’t think there actually is zero evidence even if AI risk advocates sometimes overestimate the strength of the evidence.
Yes, but in doing so the uncertainty in both A and B matters, and showing that A is lower variance than B doesn’t show that E[benefits(A)] > E[benefits(B)]. Even if benefits(B) are highly uncertain and we know benefits(A) extremely precsiely, it can still be the case that benefits(B) are larger in expectation.
If you properly account for uncertainty, you should pick the certain cause over the uncertain one even if a naive EV calculation says otherwise, because you aren’t accounting for the selection process involved in picking the cause. I’m writing an explainer for this, but if I’m reading the optimisers curse paper right, a rule of thumb is that if cause A is 10 times more certain than cause B, cause B should be downweighted by a factor of 100 when comparing them.
I will caveat this by saying that in my opinion it makes sense for estimation purposes to discount or shrink estimates of highly uncertainty quantities, which I think many advocates of AI as a cause fail to do and can be fairly criticized for. But the issue is a quantitative one, and so can come out either way. I think there is a difference between saying that we should heavily shrink estimates related to AI due to their uncertainty and lower quality evidence, vs saying that they lack any evidence whatsoever.
I feel like my position is consistent with what you have said, I just view this as part of the estimation process. When I say “E[benefits(A)] > E[benefits(B)]” I am assuming these are your best all-inclusive estimates including regularization/discounting/shrinking of highly variable quantities. In fact I think its also fine to use things other than expected value or in general use approaches that are more robust to outliers/high-variance causes. As I say in the above quote, I also think it is a completely reasonable criticism of AI risk advocates that they fail to do this reasonably often.
If you properly account for uncertainty, you should pick the certain cause over the uncertain one even if a naive EV calculation says otherwise
This is sometimes correct, but the math could come out that the highly uncertain cause area is preferable after adjustment. Do you agree with this? That’s really the only point I’m trying to make!
I don’t think the difference here comes down to one side which is scientific and rigorous and loves truth against another that is bias and shoddy and just wants to sneak there policies through in an underhanded manner with no consideration for evidence or science. Analyzing these things is messy, and different people interpret evidence in different ways or weigh different factors differently. To me this is normal and expected.
I’d be very interested to read your explainer, it sounds like it addresses a valid concern with arguments for AI risk that I also share.
The “scientific” phrasing frustrates me because I feel like it is often used to suggest high rigor without actually demonstrating that such rigor actually applies to a give situation, and because I feel like it is used to exclude certain categories of evidence when those categories are relevant, even if they are less strong compared to other kinds of evidence. I think we should weigh all relevant evidence, not exclude cetain pieces because they aren’t scientific enough.
Again, you are attacking me because of the word “scientific” instead of attacking my arguments. As I have many, many times said, studies should be weighted based on their content and the scrutiny it receives. To oppose the word “science” just because of the word itself is silly. Your idea that works are arbitrarily sorted to “scientific” and “non-scientific” based on “style points” instead of assessing their merits is just wrong and a straw-man argument.
I don’t think my argument leads to this conclusion. I’m just saying that AI risk has some evidence behind it, even if it isn’t the most rigorous evidence! That’s why I’m being such a stickler about this! If it were true that AI risk has actually zero evidence then of course I wouldn’t buy it! But I don’t think there actually is zero evidence even if AI risk advocates sometimes overestimate the strength of the evidence.
Where have I ever claimed that there is no evidence worth considering? In the start of my post, I write:
What unites many of these statements is the thorough lack of any evidence.
There are some studies that are rigorously conducted that provide some meager evidence. Not really enough to justify any EA intervention. But instead of referring to these studies, people use stuff like narrative arguments and ad-hoc models, which have approximately zero evidential value. That is the point of my post.
What about these statements makes you think that I don’t believe uncertainty affects decision making? It seems like I say that it does affect decision making in my comment.
If you believe this, I don’t understand where you disagree with me, other than you weird opposition to the word “scientific”.
Where have I ever claimed that there is no evidence worth considering?
In your OP, you write:
In this post, I’ve criticized non-evidence-based arguments, which hangs on the idea that evidence is something that is inherently required. Yet it has become commonplace to claim the opposite. One example of this argument is presented in the International AI Safety Report
You then quote the following:
Given sometimes rapid and unexpected advancements, policymakers will often have to weigh potential benefits and risks of imminent AI advancements without having a large body of scientific evidence available. In doing so, they face a dilemma. On the one hand, pre-emptive risk mitigation measures based on limited evidence might turn out to be ineffective or unnecessary. On the other hand, waiting for stronger evidence of impending risk could leave society unprepared or even make mitigation impossible – for instance if sudden leaps in AI capabilities, and their associated risks, occur.
Your summary of the quoted text is inaccurate. You claim that this is an arguement that evidence is not something that in inherently required, but the quote says no such thing. Instead, it references “a large body of scientific evidence” and “stronger evidence” vs “limited evidence”. This quote essential makes the same arguement I do above. How can we square the differences in these interpretations?
In response to me, you write:
In my post, I referred to the concept of “evidence-based policy making”. In this context, evidence refers specifically to rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes. Scientific evidence, as I said, referring to high-quality studies corroborated by other studies.
So, as used in your post, “evidence” means “rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes”. This is why I find your reference to “scientific evidence” frustrating. You draw a distinct between two categories of evidence and claim policy should be based on only one. I disagree, I think policy should be based on all available evidence, including intuition and anecdote (“unsubtantiated belief” obviously seems definitionally not evidence). I also think your argument relies heavily on contrasting with a hypothetical highly rigorous body of evidence that isn’t often achieved, which is why I have pointed out what I see as the “messiness” of lots of published scientific research.
The distinction you draw and how you defined “evidence” results in an equivocation. Your caracterization of the quote above only makes sense if you are claiming that AI risk can only claim to be “evidence-based” if is is backed by “high-quality studies that withstand scrutiny”. In other words, as I said in one of my comments:
It seems like the core of your argument is saying that there is a high burden of proof that hasn’t been met.
So, where do we disagreee? As I say immediately after:
I agree that arguments for short timelines haven’t met a high burden of proof but I don’t believe that there is such a burden.
I believe that we should compare E[benefits(AI)] with E[benefits(GHD)] and any other possible alternative cause areas, with no area having any specific burden of proof. The quality of the evidence plays out in taking those expectations. Different people may disagree on the results based on their interpretations of the evidence. People might weigh different sources of evidence differently. But there is no specific burden to have “high-quality studies that withstand scrutiny”, although this obviously weighs in favor of a cause that does have those studies. I don’t think having high quality studies amounts to “style points”. What I think would amount to “style points” is if someone concluded that E[benefits(AI)] > E[benefits(GHD)] but went with GHD anyway because they think AI is off limits due to the lack of “high-quality studies that withstand scrutiny” (i.e. if there is a burden of proof where “high-quality studies that withstand scrutiny” are required).
If you believe that evidence that does not withstand scrutiny (that is, evidence that does not meet basic quality standards, contains major methodological errors, is statistically insignificant, is based on fallacious reasoning, or any other reason why the evidence is scrutinized) is evidence that we should use, then you are advocating for pseudoscience. The expected value of benefits based on such evidence is near zero.
I’m sorry if criticizing pseudoscience is frustrating, but that kind of thinking has no place in rational decision-making.
Your summary of the quoted text is inaccurate. You claim that this is an arguement that evidence is not something that in inherently required, but the quote says no such thing. Instead, it references “a large body of scientific evidence” and “stronger evidence” vs “limited evidence”. This quote essential makes the same arguement I do above. How can we square the differences in these interpretations?
The quoted text implies that the evidence would not be sufficient under normal circumstances, hence the “evidence dilemma”. If the amount of evidence was sufficient, there would be no question about what is the correct action. While the text washes its hands from making the actual decision to rely on insufficient evidence, it clearly considers this as a serious possibility, which is not something that I believe anyone should advocate.
You are splitting hairs about the difference between “no evidence” and “limited evidence”. The report considers a multitude of different AI risks, some of which have more evidence and some of which have less. What is important is that they bring up the idea that policy should be made without proper evidence.
If you believe that evidence that does not withstand scrutiny (that is, evidence that does not meet basic quality standards, contains major methodological errors, is statistically insignificant, is based on fallacious reasoning, or any other reason why the evidence is scrutinized) is evidence that we should use, then you are advocating for pseudoscience. The expected value of benefits based on such evidence is near zero.
I don’t think evidence which is based on something other than “high-quality studies that withstand scrutiny” is pseudoscience. You could have moderate-quality studies that withstand scutiny, you could have preliminary studies which are suggestive but which haven’t been around long enough for scrutiny to percolate up. I don’t think these things have near zero evidential value.
This is my issue with your use of the term “scientific evidence” and related concepts. Its role in the argument is mostly rhetorical, having the effective of charcterizing other arguments or positions as not worthy of consideration without engaging with the messy question of what value various pieces of evidence actually have. It causes confusion and results in you equivocating about what counts as “evidence”.
My view, and where we seem to disagree, is that I think there are types of evidence other than “high-quality studies that withstand scrutiny” and pseudoscience. Look, I agree that if something has basically zero evidential value we can reasonably round that off to zero. But “limited evidence” isn’t the same as near-zero evidence. I think there is a catgory of evidence between pseudoscience/near-zero evidence and “high-quality studies that withstand scrutiny”. When we don’t have access to the highest quality evidence, it is acceptable in my view to make policy based on the best evidence that we have, including if it is in that imtermediate category. This is the same argument made in the quote from the report.
The quoted text implies that the evidence would not be sufficient under normal circumstances
This is exactly what I mean when I say this approach results in you equivocating. In your OP, you explicitly claim that this quote argues that evidence is not something that is needed. You clarify in your comments with me and in a clarification at the top of your post that only “high-quality studies that withstand scrutiny” really count as evidence as you use the term. The fact that you are using the word “evidence” in this way is causing you to misinterpret the quoted statement. The quote is saying that even if we don’t have the ideal, high-quality evidence that we would like and that might be need for us to be highly confident and establish a strong consensus that in situations of uncertainty it is acceptable to make policy based on more limited or moderate evidence. I share this view and think it is reasonable nad not pseudoscientific or somehow a claim that evidence of some kind isn’t required.
If the amount of evidence was sufficient, there would be no question about what is the correct action.
Uncetainty exists! You can be in a situation where the correct decision isn’t clear because the available information isn’t ideal. This is extremely common in real-world decision making. The entire point of this quote and my own comments is that when these situations arise the reasonable thing to do is to make the best possible decision with the information you have (which might involve trying to get more information) rather than declaring some policies off the table because they don’t have the highest quailty evidence supporting them. Making decisions under uncertainty means making decisions based on limited evidence sometimes.
Your argument is very similar to creationist and other pseudoscientific/conspiracy theory-style arguments.
A creationist might argue that the existence of life, humanity, and other complex phenomena is “evidence” for intelligent design. If we allow this to count as “limited” evidence (or whatever term we choose to use), it is possible to follow through a Pascal’s wager-style argument and posit that this “evidence”, even if it has high uncertainty, is enough to merit an action.
It is always possible to come up with “evidence” for any claim. In evidence-based decision making, we must set a bar for evidence. Otherwise, the word “evidence” would lose it’s meaning, and we’d be wasting our resources considering every piece of knowledge there exists as “evidence”.
You could have moderate-quality studies that withstand scutiny
If the studies withstand scrutiny, then they are high-quality studies. Of course, it is possible that the study has multiple conclusions, and some of them are undermined by scrutiny and some are not, or that there are errors that do not undermine the conclusions. These studies can of course be used as evidence. I used “high-quality” as the opposite of “low-quality”, and splitting hair about “moderate-quality” is uninteresting.
you could have preliminary studies which are suggestive but which haven’t been around long enough for scrutiny to percolate up
This is a good basis when, e.g., funding new research, as confirming and replicating recent studies is an important part of science. In this case, it doesn’t matter that much if the study’s conclusions end up being true or false, as confirming either way is valuable. Researching interesting things is good, and even bad studies are evidence that the topic is interesting. But they are not evidence that should be used for other kind of decision-making.
The fact that you are using the word “evidence” in this way is causing you to misinterpret the quoted statement.
You are again splitting a hair about the meanings of words. The important thing is that they are advocating for making decisions without sufficient evidence, which is something I oppose. Their report is long and contains many AI risks, some of which (like deepfakes) have high-quality studies behind them, while others (like X-risks) do not. As a whole, the report “has some evidence” that there are risks associated with AI. So they talk about “limited evidence”. What is important is that they imply this “limited evidence” is not sufficient for making decisions.
But “limited evidence” isn’t the same as near-zero evidence
Splitting a hair. You can call your evidence limited evidence if you want. It won’t get you a free pass that your argument should be considered. If it has too much uncertainty or doesn’t withstand scrutiny, it shouldn’t be taken in as evidence. Otherwise we end up in the creationist situation.
People who have radical anti-institutionalist views often take reasonable criticisms of institutions and use them to argue for their preferred radical alternative. There are many reasonable criticisms of liberal democracy; these are eagerly seized on by Marxist-Leninists, anarchists, and right-wing authoritarians to insist that their preferred political system must be better. But of course this conclusion does not necessarily follow from those criticisms, even if the criticisms are sound. The task for the challenger is to support the claim that their preferred system is robustly superior, not simply that liberal democracy is flawed.
The same is true for radical anti-institutionalist views on institutional science (which the LessWrong community often espouses, or at least whenever it suits them). Pointing out legitimate failures in institutional science does not necessarily support the radical anti-institutionalists’ conclusion that peer-reviewed journals, universities, and government science agencies should be abandoned in favour of blogs, forums, tweets, and self-published reports or pre-prints. On what basis can the anti-institutionalists claim that this is a robustly superior alternative and not a vastly inferior one?
To be clear, I interpret you as making a moderate anti-institutionalist argument, not a radical one. But the problem with the reasoning is the same in either case — which is why I’m using the radical arguments for illustration. The guardrails in academic publishing sometimes fail, as in the case of research misconduct or in well-intentioned, earnestly conducted research that doesn’t replicate as you mentioned. But is this an argument for kicking down all guardrails? Shouldn’t it be the opposite? Doesn’t this just show us that deeply flawed research can slip under the radar? Shouldn’t this underscore the importance of savvy experts doing close, critical readings of research to find flaws? Shouldn’t the replication crisis remind of us of the importance of replication (which has always been a cornerstone of institutional science)? Why should the replication crisis be taken as license to give up on institutions and processes that attempt to enforce academic rigour, including replication?
In the case of both AI 2027 and the METR graph, half of the problem is the underlying substance — the methodology, the modelling choices, the data. The other half of the problem is the presentation. Both have been used to make bold, sweeping, confident claims. Academic journals referee both the substance and the presentation of submitted research; they push back on authors trying to use their data or modelling to make conclusions that are insufficiently supported.
In this vein, one of the strongest critiques of AI 2027 is that it is an exercise in judgmental forecasting, in which the authors make intuitive, subjective guesses about the future trajectory of AI research and technology development. There’s nothing inherently wrong with a judgmental forecasting exercise, but I don’t think the presentation of AI 2027 is clear enough that AI 2027 is nothing more than that. (80,000 Hours’ video on AI 2027, which is 34 minutes long and was carefully written and produced at a cost of $160,000, doesn’t even mention this.)
If AI 2027 had been submitted to a reputable peer-reviewed journal, besides hopefully catching the modelling errors, the reviewers probably would have insisted the authors make it clear from the outset what data the conclusions are based on (i.e. the authors’ judgmental forecasts) and where that data came from. They would probably also have insisted the conclusions are appropriately moderated and caveated in light of that. But, overall, I think AI 2027 would probably just be unpublishable.
I don’t think my argument is even that anti-institutionalist. I have issues with how academic publishing works but I still think peer reviewed research is an extremely important and valuable source of information. I just think it has flaws and is much messier than discussions around the topic sometimes make it seem.
My point isn’t to say that we should throw out traditional academic insitutions, it is to say that I feel like the claim that the arguments for short timelines are “non-evidence-based” are critiquing the same messiness that also is present in peer reviewed research. If I read a study whose conclusions I disagree with, I think it would be wrong to say “field X has a replication crisis, therefore we can’t really consider this study to be evidence”. I feel like a similar thing is going on when people say the arguments for short timelines are “non-evidence-based”. To me things like METR’s work definitely are evidence, even if they aren’t necessarily strong or definitive evidence or if that evidence is open to contested interpretations. I don’t think something needs to be peer reviewed to count as “evidence”, is essentially the point I was trying to make.
Generally, the scientific community is not going around arguing that drastic measures should be taken based on singular novel studies. Mainly, what a single novel study will produce is a wave of new studies on the same subject, to ensure that the results are valid and that the assumptions used hold up to scrutiny. Hence why that low-temperature superconductor was so quickly debunked.
I do not see similar efforts in the AI safety community. The studies by METR are great first forays into difficult subjects, but then I see barely any scrutinity or follow-up by other researchers. And people accept much worse scholarship like AI2027 at face-value for seemingly no reason.
I have experience in both academia and EA now, and I believe that the scholarship and skeptical standards in EA are substantially worse.
I agree. EA has a cost-effectiveness problem that conflicts with its truth-seeking attempts. EA’s main driving force is cost-effectiveness, above all else—even above truth itself.
EA is highly incentivised to create and spread apocalyptic doom narratives. This is because apocalyptic doom narratives are good at recruiting people to EA’s “let’s work to decrease the probability of apocalyptic doom (because that has lots of expected value given future population projections)” cause area. And funding-wise, EA community funding (at least in the UK) is pretty much entirely about trying to make more people work in these areas.
EA is also populated by the kinds of people who respond to apocalyptic doom narratives, for the basic reason that if they didn’t they wouldn’t have ended up in EA. So stuff that promotes these narratives does well in EA’s attention economy.
EA just doesn’t have anywhere near as much £$€ to spend as academia does. It’s also very interested in doing stuff and willing to tolerate errors as long as the stuff gets done. Therefore, its academic standards are far lower.
I really don’t know how you’d fix this. I don’t think research into catastrophic risks should be conducted on a shoestring budget and by a pseudoreligion/citizen science community. I think it should be government funded and probably sit within the wider defense and security portfolio.
However I’ll give EA some grace for essentially being a citizen science community, for the same reason I don’t waste effort grumping about the statistical errors made by participants in the Big Garden Birdwatch.
Generally, the scientific community is not going around arguing that drastic measures should be taken based on singular novel studies. Mainly, what a single novel study will produce is a wave of new studies on the same subject, to ensure that the results are valid and that the assumptions used hold up to scrutiny. Hence why that low-temperature superconductor was so quickly debunked.
I agree that on average the scientific community does a great job of this, but I think the process is much much messier in practice than a general description of the process makes it seem. For example, you have the alzheimers research that got huge pick-up and massive funding by major scientific institutions where the original research included doctored images. You have power-posing getting viral attention in science-ajacent media. You have priming where Kahneman wrote in his book that even if it seems wild you have to believe in it largely for similar reasons to what is being suggested here I think, that multiple rigorous scientific studies demonstrate the phenomenon, and yet when the replication crisis came around priming looks a lot more shaky than it seemed when Kahneman wrote that.
None of this means that we should throw out the existing scientific community or declare that most published research is false (although ironically there is a peer reviewed publication with this title!). Instead, my argument is that we should understand that this process is often messy and complicated. Imperfect research still has value and in my view is still “evidence” even if it is imperfect.
The research and arguments around AI risk are not anywhere near as rigorous as a lot of scientific research (and I linked a comment above where I myself criticize AI risk advocates for overestimating the rigor of their arguments). At the same time, this doesn’t mean that these arguments do not contain any evidence or value. There is a huge amount of uncetainty about what will happen with AI. People worried about the risks from AI are trying to muddle through these issues, just like the scientific community has to muddle through figuring things out as well. I think it its completely valid to point of flaws in arguments, lack of rigor, or over confidence (as I have also done). But evidence or argument doesn’t have to appear in a journal or conference to count as “evidence”.
My view is that we have to live with the uncertainty and make decisions based on the information we have, while also trying to get better information. Doing nothing and going with the status quo is itself a decision that can have important consequences. We should use the best evidence we have to make the best decision given uncertainty, not just default to the status quo when we lack ideal, rigorous evidence.
I share the feeling that advocates of short timelines often overestimated the reliability of their methods, and have said so here.
At the same time, when I see skeptics of AI progress talk about these arguments lack “evidence” its unclear to me what the neutral criteria are for what counts as “evidence”. I agee that lots of the dynamics you describe exist, but they don’t seem at all unique to discussions of AI timelines to me. I think they stem from the fact that interpreting evidence is much messier than an academic, ivory tower view of “evidence” would make it seem.
As an example, a common critique I have noted among AI skeptics is that arguments for short timelines or AI risk don’t follow traditional academic proccess such as pre-publication peer review. The implication is that this suggests they lack rigor and in some sense shouldn’t count as “real” evidence. But it seems to me that by this standard peer review itself lacks evidence it support its practice. See the various reproducibility projects with underwhelming results and the replication crisis in general. Yes things like METR’s work have limitations and don’t precisely replicate the ideal experiment, but nor do animal models or cell lines precisely replicate human disease as seen in the clinic. Benchmarks can be gamed, but the idea of “benchmarks” comes from the machine learning literature itself, it isn’t something cooked up specifically to argue for short timelines.
What are the standards that you think an argument should meet to count as “evidence-based”?
The purpose of peer-review is to make sure that the publication has no obvious errors and meets some basic standards of publication. I have been a peer-reviewer myself, and what I have seen is that the general quality of stuff sent to computer science conferences is low. Peer-review removes the most blatantly bad papers. To a layperson who doesn’t know the field and who cannot judge the quality of studies, it is safest to stick to peer-reviewed papers.
But it has never been suggested that peer-review somehow magically separates good evidence from bad evidence. In my work, I often refer to arXiv papers that are not peer-reviewed, but which I believe are methodologically sound and present valuable contributions. On the other hand, I know that conferences and journals often publish papers even with grave methodological errors or lack of statistical understanding.
Ultimately, the real test of a study is the criticism it receives after its publication, not peer-review. If researchers in the field think that the study is good and build their research on it, it is much more credible evidence than a study that is disproved by studies that come after it. One should never rely on a single study alone.
In case of METR’s study, their methodological errors do not preclude that their conclusions are correct. I think what they are trying to do is interesting and worth of research. I’d love to see other researchers attempt to replicate the study while improving on methodology, and if they succeed in having similar results, providing evidence for METR’s conclusions. So far, we haven’t seen this (or at least I am not aware of). Although even in that case, the problem of evidence mismatch stays, and we should be careful not to draw those conclusions to far.
This seems reasonable to me, but I don’t think its necesarily entirely consistent with the OP. I think a lot of the reason why AI is such a talked about topic compared to 5 years ago is that people have seen work that has gone on in the field and are building on and reacting to it. In other words, they perceive existing results to be evidence of significant progress and opportunities. They could be overreaching to or overhyping those results, but to me it doesn’t seem fair to say that the belief in short timelines is entirely “non-evidence-based”. Things like METR’s work, scaling laws, benchmarks, these are evidence even if they aren’t necesarily strong or definitive evidence.
I think it is reasonable to disagree with the conclusions that people draw based on these things, but I don’t entirely understand the argument that these things are “non-evidence-based”. I think it is worthwhile to distinquish between a disagreement over methodology, evidence strength, or interpretation, and the case where an argument is literally completely free of any evidence or substantiation whatsoever. In my view, arguments for short timelines contain evidence, but that doesn’t mean that their conclusions are correct.
In my post, I referred to the concept of “evidence-based policy making”. In this context, evidence refers specifically to rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes. Scientific evidence, as I said, referring to high-quality studies corroborated by other studies. And, as I emphasize the point of evidence mismatch, using a study that concludes something as evidence for something else is a fallacy.
The idea that current progress in AI can be taken as evidence for AGI, which in some sense is the most extreme progress in AI imaginable, incomparable to current progress, is an extraordinary claim that requires extraordinary evidence. People arguing for this are mostly basing their argument on their intuition and guesses, yet they often demand drastic actions over their beliefs. We, as the EA community, should make decisions based on evidence. Currently, people are providing substantial funding to the “AI cause” based on arguments that do not meet the bar of evidence-based policy, and I think that is something that should and must be criticized.
It seems like the core of your argument is saying that there is a high burden of proof that hasn’t been met. I agree that arguments for short timelines haven’t met a high burden of proof but I don’t believe that there is such a burden. I will try to explain my reasoning, although I’m not sure if I can do the argument justice in a comment, perhaps I will try to write a post about the issue.
When it comes to policy, I think the goal should be to make good decisions. You don’t get any style points for how good your arguments or evidence are if the consequences of your decisions are bad. That doesn’t mean we shouldn’t use evidence to make decisions, we certainty should. But the reason is that using evidence will improve the quality of the decision, not for “style points” so-to-speak.
Doing nothing and sticking with the status quo is also a decision that can have important consequences. We can’t just magically have more rigorous evidence, we have to make decisions and allocate resources in order to get that evidence. That also requires making decisions about the allocation of resources. When we make those decisions, we have to live with the uncetainty that we face, and make the best decision given that uncertainty. If we don’t have solid scientific evidence, we still have to make some decision. It isn’t optional. Sticking with the status quo is still making a decision. If we lack scientific evidence, then that policy decision won’t be evidence-based even if we do nothing. I think we should make the best decision we can given what information we have instead of defaulting to an informal burden of proof. If there is a formal burden of proof, like a burden on one party in a court case or a procedure for how an administrative or legislative body should decide, then in my view that formal procedure establishes what the burden of proof is.
Although I believe there should be policy action/changes in response to the risk from AI, I personally don’t see the case for this as hinging on the achievement of “AGI”. I’ve described my position as being more concerned about “powerful” AI than “intelligent” AI. I think focusing on “AGI” or how “intelligent” an AI system is or will be often leads to unproductive rabbit holes or definition debates. On the other hand, obviously lots of AI risk advocates do focus on AGI, so I acknowledge it is completely fair game for skeptics to critique this.
Do you think you would be more open to some types of AI policy if the case for those policies didn’t rely on the emergence of “AGI”?
No one has ever claimed that evidence should be collected for “style points”.
Fortunately, AI research has a plenty of funding right now (without any EA money), so in principle getting evidence should not be an issue. I am not against research, I am a proponent of it.
Sticking with status quo is often the best decision. When deciding how to use funds efficiently, you have to consider the opportunity cost of using those funds to something that has a certain positive benefit. And that alternative action is evidence-based. Thus, the dichotomy between “acting on AI without evidence” and “doing nothing without evidence” is false, the options are actually “acting on AI without evidence” and “acting on another cause area with evidence”.
If the estimated value of using the money for AI is below the benefit of the alternative, we should not use it for AI and instead stick to the status quo on that matter. Most AI interventions are not tractable, and due to this their actual utility might even be negative.
Yes, there are several types of AI policy I support. However, I don’t think they are important cause areas for EA.
AI certainly has a lot of resources available, but I don’t think those resources are primarily being used to understand how AI will impact society. I think policy could push more in this direction. For example, requiring AI companies who train/are training models above a certain compute budget to undergo third-party audits of their training process and models would push towards clarifying some of these issues in my view.
The conclusion that cause A is preferable to cause B involves the uncertainty about both causes. Even if cause A has more rigorous evidence than cause B, that doesn’t mean the conclusion that benefits(A) > benefits(B) is similarly rigorous.
Lets take AI and global health and development (GHD) as an example. I think it would be reasonable to say that evidence for GHD is much more rigorous and scientific than the evidence for AI. Yet that doesn’t mean that the evidence conclusively shows benefits(GHD) > benefits(AI). Lets say that someone believes that the evidence for GHD is scientific and the evidence for AI is not (or at least much less so), but that the overall, all-things-considered best estimate of benefits(AI) are greater than the best estimate of benefits(GHD). I think many people in the EA community in fact have this view. Do you think those people should still prefer GHD because AI is off limits due to not being “scientific”? I would consider this to be “for style points”, and disagree with this approach.
I will caveat this by saying that in my opinion it makes sense for estimation purposes to discount or shrink estimates of highly uncertainty quantities, which I think many advocates of AI as a cause fail to do and can be fairly criticized for. But the issue is a quantitative one, and so can come out either way. I think there is a difference between saying that we should heavily shrink estimates related to AI due to their uncertainty and lower quality evidence, vs saying that they lack any evidence whatsoever.
I agree, but it doesn’t follow from one cause being “scientific” while the other isn’t that the “scientific” cause area has higher benefits.
I actually agree that tractability is (ironically) a strongly neglected factor and many proponents of AI as a cause area ignore or vastly overestimate the tractability of AI interventions, including the very real possibility that they are counterproductive/net-negative. I still think there are worthwhile opportunities but I agree that this is an underappreciated downside of AI as a cause area.
Can I ask why? Do you think AI won’t be a “big deal” in the reasonably near future?
It seems you have an issue with the word “scientific” and are constructing a straw-man argument around it. This has nothing to do with “style points”. As I have already explained, by scientific I only refer to high-quality studies that withstand scrutiny. If a study doesn’t, then it’s value as evidence is heavily discounted, as the probability of the conclusions of the study being right despite methodological errors, failures to replicate it, etc. is lower than if the study does not have these issues. If a study hasn’t been scrutinized at all, it is likely bad, because the amount of bad research is greater than the amount of good research (for example, if we look at the rejection rates of journals/conferences), and lack of scrutiny implies lack of credibility as researchers do not take the study seriously enough to scrutinize it.
Yet E[benefits(A)] > E[benefits(B)] is a rigorous conclusion, because the uncertainty can be factored into the expected value.
The International AI Safety Report lists many realistic threats (the first one of those is deepfakes, to give an example). Studying and regulating these things is nice, but they are not effective interventions in terms of lives saved etc.
I’m really at a loss here. If your argument is taken literally, I can convince you to fund anything, since I can give you highly uncertain arguments for almost everything. I cannot believe this is really your stance. You must agree with me that uncertainty affects decision making. It only seems that the word “scientific” bothers you for some reason, which I cannot really understand either. Do you believe that methodological errors are not important? That statistical significance is not required? That replicability does matter? To object to the idea that these issues cause uncertainty is absurd.
The “scientific” phrasing frustrates me because I feel like it is often used to suggest high rigor without actually demonstrating that such rigor actually applies to a give situation, and because I feel like it is used to exclude certain categories of evidence when those categories are relevant, even if they are less strong compared to other kinds of evidence. I think we should weigh all relevant evidence, not exclude cetain pieces because they aren’t scientific enough.
Yes, but in doing so the uncertainty in both A and B matters, and showing that A is lower variance than B doesn’t show that E[benefits(A)] > E[benefits(B)]. Even if benefits(B) are highly uncertain and we know benefits(A) extremely precsiely, it can still be the case that benefits(B) are larger in expectation.
In my comment that you are responding to, I say:
I also say:
What about these statements makes you think that I don’t believe uncertainty affects decision making? It seems like I say that it does affect decision making in my comment.
If stock A very likely has a return in the range of 1-2%, and stock B very likely has a return in the range of 0-10%, do you think stock A must have a better expected return because it has lower uncertainty?
Yes uncertainty matters but it is more complicated than saying that the least uncertain option is always better. Sometimes the option that has less rigorous support is still better in an all-things-considered analysis.
I don’t think my argument leads to this conclusion. I’m just saying that AI risk has some evidence behind it, even if it isn’t the most rigorous evidence! That’s why I’m being such a stickler about this! If it were true that AI risk has actually zero evidence then of course I wouldn’t buy it! But I don’t think there actually is zero evidence even if AI risk advocates sometimes overestimate the strength of the evidence.
If you properly account for uncertainty, you should pick the certain cause over the uncertain one even if a naive EV calculation says otherwise, because you aren’t accounting for the selection process involved in picking the cause. I’m writing an explainer for this, but if I’m reading the optimisers curse paper right, a rule of thumb is that if cause A is 10 times more certain than cause B, cause B should be downweighted by a factor of 100 when comparing them.
In one of my comments above, I say this:
I feel like my position is consistent with what you have said, I just view this as part of the estimation process. When I say “E[benefits(A)] > E[benefits(B)]” I am assuming these are your best all-inclusive estimates including regularization/discounting/shrinking of highly variable quantities. In fact I think its also fine to use things other than expected value or in general use approaches that are more robust to outliers/high-variance causes. As I say in the above quote, I also think it is a completely reasonable criticism of AI risk advocates that they fail to do this reasonably often.
This is sometimes correct, but the math could come out that the highly uncertain cause area is preferable after adjustment. Do you agree with this? That’s really the only point I’m trying to make!
I don’t think the difference here comes down to one side which is scientific and rigorous and loves truth against another that is bias and shoddy and just wants to sneak there policies through in an underhanded manner with no consideration for evidence or science. Analyzing these things is messy, and different people interpret evidence in different ways or weigh different factors differently. To me this is normal and expected.
I’d be very interested to read your explainer, it sounds like it addresses a valid concern with arguments for AI risk that I also share.
Again, you are attacking me because of the word “scientific” instead of attacking my arguments. As I have many, many times said, studies should be weighted based on their content and the scrutiny it receives. To oppose the word “science” just because of the word itself is silly. Your idea that works are arbitrarily sorted to “scientific” and “non-scientific” based on “style points” instead of assessing their merits is just wrong and a straw-man argument.
Where have I ever claimed that there is no evidence worth considering? In the start of my post, I write:
There are some studies that are rigorously conducted that provide some meager evidence. Not really enough to justify any EA intervention. But instead of referring to these studies, people use stuff like narrative arguments and ad-hoc models, which have approximately zero evidential value. That is the point of my post.
If you believe this, I don’t understand where you disagree with me, other than you weird opposition to the word “scientific”.
In your OP, you write:
You then quote the following:
Your summary of the quoted text is inaccurate. You claim that this is an arguement that evidence is not something that in inherently required, but the quote says no such thing. Instead, it references “a large body of scientific evidence” and “stronger evidence” vs “limited evidence”. This quote essential makes the same arguement I do above. How can we square the differences in these interpretations?
In response to me, you write:
You also have added as a clarfication to your OP:
So, as used in your post, “evidence” means “rigorous, scientific evidence, as opposed to intuitions, unsubstantiated beliefs and anecdotes”. This is why I find your reference to “scientific evidence” frustrating. You draw a distinct between two categories of evidence and claim policy should be based on only one. I disagree, I think policy should be based on all available evidence, including intuition and anecdote (“unsubtantiated belief” obviously seems definitionally not evidence). I also think your argument relies heavily on contrasting with a hypothetical highly rigorous body of evidence that isn’t often achieved, which is why I have pointed out what I see as the “messiness” of lots of published scientific research.
The distinction you draw and how you defined “evidence” results in an equivocation. Your caracterization of the quote above only makes sense if you are claiming that AI risk can only claim to be “evidence-based” if is is backed by “high-quality studies that withstand scrutiny”. In other words, as I said in one of my comments:
So, where do we disagreee? As I say immediately after:
I believe that we should compare E[benefits(AI)] with E[benefits(GHD)] and any other possible alternative cause areas, with no area having any specific burden of proof. The quality of the evidence plays out in taking those expectations. Different people may disagree on the results based on their interpretations of the evidence. People might weigh different sources of evidence differently. But there is no specific burden to have “high-quality studies that withstand scrutiny”, although this obviously weighs in favor of a cause that does have those studies. I don’t think having high quality studies amounts to “style points”. What I think would amount to “style points” is if someone concluded that E[benefits(AI)] > E[benefits(GHD)] but went with GHD anyway because they think AI is off limits due to the lack of “high-quality studies that withstand scrutiny” (i.e. if there is a burden of proof where “high-quality studies that withstand scrutiny” are required).
If you believe that evidence that does not withstand scrutiny (that is, evidence that does not meet basic quality standards, contains major methodological errors, is statistically insignificant, is based on fallacious reasoning, or any other reason why the evidence is scrutinized) is evidence that we should use, then you are advocating for pseudoscience. The expected value of benefits based on such evidence is near zero.
I’m sorry if criticizing pseudoscience is frustrating, but that kind of thinking has no place in rational decision-making.
The quoted text implies that the evidence would not be sufficient under normal circumstances, hence the “evidence dilemma”. If the amount of evidence was sufficient, there would be no question about what is the correct action. While the text washes its hands from making the actual decision to rely on insufficient evidence, it clearly considers this as a serious possibility, which is not something that I believe anyone should advocate.
You are splitting hairs about the difference between “no evidence” and “limited evidence”. The report considers a multitude of different AI risks, some of which have more evidence and some of which have less. What is important is that they bring up the idea that policy should be made without proper evidence.
I don’t think evidence which is based on something other than “high-quality studies that withstand scrutiny” is pseudoscience. You could have moderate-quality studies that withstand scutiny, you could have preliminary studies which are suggestive but which haven’t been around long enough for scrutiny to percolate up. I don’t think these things have near zero evidential value.
This is my issue with your use of the term “scientific evidence” and related concepts. Its role in the argument is mostly rhetorical, having the effective of charcterizing other arguments or positions as not worthy of consideration without engaging with the messy question of what value various pieces of evidence actually have. It causes confusion and results in you equivocating about what counts as “evidence”.
My view, and where we seem to disagree, is that I think there are types of evidence other than “high-quality studies that withstand scrutiny” and pseudoscience. Look, I agree that if something has basically zero evidential value we can reasonably round that off to zero. But “limited evidence” isn’t the same as near-zero evidence. I think there is a catgory of evidence between pseudoscience/near-zero evidence and “high-quality studies that withstand scrutiny”. When we don’t have access to the highest quality evidence, it is acceptable in my view to make policy based on the best evidence that we have, including if it is in that imtermediate category. This is the same argument made in the quote from the report.
This is exactly what I mean when I say this approach results in you equivocating. In your OP, you explicitly claim that this quote argues that evidence is not something that is needed. You clarify in your comments with me and in a clarification at the top of your post that only “high-quality studies that withstand scrutiny” really count as evidence as you use the term. The fact that you are using the word “evidence” in this way is causing you to misinterpret the quoted statement. The quote is saying that even if we don’t have the ideal, high-quality evidence that we would like and that might be need for us to be highly confident and establish a strong consensus that in situations of uncertainty it is acceptable to make policy based on more limited or moderate evidence. I share this view and think it is reasonable nad not pseudoscientific or somehow a claim that evidence of some kind isn’t required.
Uncetainty exists! You can be in a situation where the correct decision isn’t clear because the available information isn’t ideal. This is extremely common in real-world decision making. The entire point of this quote and my own comments is that when these situations arise the reasonable thing to do is to make the best possible decision with the information you have (which might involve trying to get more information) rather than declaring some policies off the table because they don’t have the highest quailty evidence supporting them. Making decisions under uncertainty means making decisions based on limited evidence sometimes.
Your argument is very similar to creationist and other pseudoscientific/conspiracy theory-style arguments.
A creationist might argue that the existence of life, humanity, and other complex phenomena is “evidence” for intelligent design. If we allow this to count as “limited” evidence (or whatever term we choose to use), it is possible to follow through a Pascal’s wager-style argument and posit that this “evidence”, even if it has high uncertainty, is enough to merit an action.
It is always possible to come up with “evidence” for any claim. In evidence-based decision making, we must set a bar for evidence. Otherwise, the word “evidence” would lose it’s meaning, and we’d be wasting our resources considering every piece of knowledge there exists as “evidence”.
If the studies withstand scrutiny, then they are high-quality studies. Of course, it is possible that the study has multiple conclusions, and some of them are undermined by scrutiny and some are not, or that there are errors that do not undermine the conclusions. These studies can of course be used as evidence. I used “high-quality” as the opposite of “low-quality”, and splitting hair about “moderate-quality” is uninteresting.
This is a good basis when, e.g., funding new research, as confirming and replicating recent studies is an important part of science. In this case, it doesn’t matter that much if the study’s conclusions end up being true or false, as confirming either way is valuable. Researching interesting things is good, and even bad studies are evidence that the topic is interesting. But they are not evidence that should be used for other kind of decision-making.
You are again splitting a hair about the meanings of words. The important thing is that they are advocating for making decisions without sufficient evidence, which is something I oppose. Their report is long and contains many AI risks, some of which (like deepfakes) have high-quality studies behind them, while others (like X-risks) do not. As a whole, the report “has some evidence” that there are risks associated with AI. So they talk about “limited evidence”. What is important is that they imply this “limited evidence” is not sufficient for making decisions.
Splitting a hair. You can call your evidence limited evidence if you want. It won’t get you a free pass that your argument should be considered. If it has too much uncertainty or doesn’t withstand scrutiny, it shouldn’t be taken in as evidence. Otherwise we end up in the creationist situation.
People who have radical anti-institutionalist views often take reasonable criticisms of institutions and use them to argue for their preferred radical alternative. There are many reasonable criticisms of liberal democracy; these are eagerly seized on by Marxist-Leninists, anarchists, and right-wing authoritarians to insist that their preferred political system must be better. But of course this conclusion does not necessarily follow from those criticisms, even if the criticisms are sound. The task for the challenger is to support the claim that their preferred system is robustly superior, not simply that liberal democracy is flawed.
The same is true for radical anti-institutionalist views on institutional science (which the LessWrong community often espouses, or at least whenever it suits them). Pointing out legitimate failures in institutional science does not necessarily support the radical anti-institutionalists’ conclusion that peer-reviewed journals, universities, and government science agencies should be abandoned in favour of blogs, forums, tweets, and self-published reports or pre-prints. On what basis can the anti-institutionalists claim that this is a robustly superior alternative and not a vastly inferior one?
To be clear, I interpret you as making a moderate anti-institutionalist argument, not a radical one. But the problem with the reasoning is the same in either case — which is why I’m using the radical arguments for illustration. The guardrails in academic publishing sometimes fail, as in the case of research misconduct or in well-intentioned, earnestly conducted research that doesn’t replicate as you mentioned. But is this an argument for kicking down all guardrails? Shouldn’t it be the opposite? Doesn’t this just show us that deeply flawed research can slip under the radar? Shouldn’t this underscore the importance of savvy experts doing close, critical readings of research to find flaws? Shouldn’t the replication crisis remind of us of the importance of replication (which has always been a cornerstone of institutional science)? Why should the replication crisis be taken as license to give up on institutions and processes that attempt to enforce academic rigour, including replication?
In the case of both AI 2027 and the METR graph, half of the problem is the underlying substance — the methodology, the modelling choices, the data. The other half of the problem is the presentation. Both have been used to make bold, sweeping, confident claims. Academic journals referee both the substance and the presentation of submitted research; they push back on authors trying to use their data or modelling to make conclusions that are insufficiently supported.
In this vein, one of the strongest critiques of AI 2027 is that it is an exercise in judgmental forecasting, in which the authors make intuitive, subjective guesses about the future trajectory of AI research and technology development. There’s nothing inherently wrong with a judgmental forecasting exercise, but I don’t think the presentation of AI 2027 is clear enough that AI 2027 is nothing more than that. (80,000 Hours’ video on AI 2027, which is 34 minutes long and was carefully written and produced at a cost of $160,000, doesn’t even mention this.)
If AI 2027 had been submitted to a reputable peer-reviewed journal, besides hopefully catching the modelling errors, the reviewers probably would have insisted the authors make it clear from the outset what data the conclusions are based on (i.e. the authors’ judgmental forecasts) and where that data came from. They would probably also have insisted the conclusions are appropriately moderated and caveated in light of that. But, overall, I think AI 2027 would probably just be unpublishable.
I don’t think my argument is even that anti-institutionalist. I have issues with how academic publishing works but I still think peer reviewed research is an extremely important and valuable source of information. I just think it has flaws and is much messier than discussions around the topic sometimes make it seem.
My point isn’t to say that we should throw out traditional academic insitutions, it is to say that I feel like the claim that the arguments for short timelines are “non-evidence-based” are critiquing the same messiness that also is present in peer reviewed research. If I read a study whose conclusions I disagree with, I think it would be wrong to say “field X has a replication crisis, therefore we can’t really consider this study to be evidence”. I feel like a similar thing is going on when people say the arguments for short timelines are “non-evidence-based”. To me things like METR’s work definitely are evidence, even if they aren’t necessarily strong or definitive evidence or if that evidence is open to contested interpretations. I don’t think something needs to be peer reviewed to count as “evidence”, is essentially the point I was trying to make.
Generally, the scientific community is not going around arguing that drastic measures should be taken based on singular novel studies. Mainly, what a single novel study will produce is a wave of new studies on the same subject, to ensure that the results are valid and that the assumptions used hold up to scrutiny. Hence why that low-temperature superconductor was so quickly debunked.
I do not see similar efforts in the AI safety community. The studies by METR are great first forays into difficult subjects, but then I see barely any scrutinity or follow-up by other researchers. And people accept much worse scholarship like AI2027 at face-value for seemingly no reason.
I have experience in both academia and EA now, and I believe that the scholarship and skeptical standards in EA are substantially worse.
I agree. EA has a cost-effectiveness problem that conflicts with its truth-seeking attempts. EA’s main driving force is cost-effectiveness, above all else—even above truth itself.
EA is highly incentivised to create and spread apocalyptic doom narratives. This is because apocalyptic doom narratives are good at recruiting people to EA’s “let’s work to decrease the probability of apocalyptic doom (because that has lots of expected value given future population projections)” cause area. And funding-wise, EA community funding (at least in the UK) is pretty much entirely about trying to make more people work in these areas.
EA is also populated by the kinds of people who respond to apocalyptic doom narratives, for the basic reason that if they didn’t they wouldn’t have ended up in EA. So stuff that promotes these narratives does well in EA’s attention economy.
EA just doesn’t have anywhere near as much £$€ to spend as academia does. It’s also very interested in doing stuff and willing to tolerate errors as long as the stuff gets done. Therefore, its academic standards are far lower.
I really don’t know how you’d fix this. I don’t think research into catastrophic risks should be conducted on a shoestring budget and by a pseudoreligion/citizen science community. I think it should be government funded and probably sit within the wider defense and security portfolio.
However I’ll give EA some grace for essentially being a citizen science community, for the same reason I don’t waste effort grumping about the statistical errors made by participants in the Big Garden Birdwatch.
I agree that on average the scientific community does a great job of this, but I think the process is much much messier in practice than a general description of the process makes it seem. For example, you have the alzheimers research that got huge pick-up and massive funding by major scientific institutions where the original research included doctored images. You have power-posing getting viral attention in science-ajacent media. You have priming where Kahneman wrote in his book that even if it seems wild you have to believe in it largely for similar reasons to what is being suggested here I think, that multiple rigorous scientific studies demonstrate the phenomenon, and yet when the replication crisis came around priming looks a lot more shaky than it seemed when Kahneman wrote that.
None of this means that we should throw out the existing scientific community or declare that most published research is false (although ironically there is a peer reviewed publication with this title!). Instead, my argument is that we should understand that this process is often messy and complicated. Imperfect research still has value and in my view is still “evidence” even if it is imperfect.
The research and arguments around AI risk are not anywhere near as rigorous as a lot of scientific research (and I linked a comment above where I myself criticize AI risk advocates for overestimating the rigor of their arguments). At the same time, this doesn’t mean that these arguments do not contain any evidence or value. There is a huge amount of uncetainty about what will happen with AI. People worried about the risks from AI are trying to muddle through these issues, just like the scientific community has to muddle through figuring things out as well. I think it its completely valid to point of flaws in arguments, lack of rigor, or over confidence (as I have also done). But evidence or argument doesn’t have to appear in a journal or conference to count as “evidence”.
My view is that we have to live with the uncertainty and make decisions based on the information we have, while also trying to get better information. Doing nothing and going with the status quo is itself a decision that can have important consequences. We should use the best evidence we have to make the best decision given uncertainty, not just default to the status quo when we lack ideal, rigorous evidence.