Iām on leave from the economics PhD program at UChicago to work at the Forecasting Research Institute.
Connacher Murphy šø
Thanks for the thoughts! The question is indeed framed as being about capabilities and not adoption, and this is absolutely central.
Second, people have a wide range of views on any given topic, and surveys reflect this distribution. I think this is a feature, not a bug. Additionally, if you take any noisy measurement (which all surveys are), reading too much into the tails can lead one astray (I donāt think thatās happening in this specific instance, but I want to guard against the view that the existence of noise implies the nonexistence of signal). Nevertheless, I do appreciate the careful read.
Your comments here are part of why I think including the third disclaimer we add that allows for jagged capabilities is important. Additionally, we donāt require that all capabilities are achieved, hence the ābest matchingā qualifier, rather than looking at the minimum across the capabilities space.
We indeed developed/ātested versions of this question which included a section on current capabilities. Survey burden is another source of noise/ābias in surveys, so such modifications are not costless. I absolutely agree that current views of progress will impact responses to this question.
Iāll reiterate that LEAP is a portfolio of questions, and I think we have other questions where disagreement about current capabilities is less of an issue because the target is much less dependent on subjective assessment, but those questions will sacrifice some degree of being complete pictures of AI capabilities. Lastly, any expectation of the future necessarily includes some model of the present.
Always happy to hear suggestions for a new question or revised version of this question!
Thanks for following up!
I am using āextremeā in a very narrow sense, meaning anything above and below the scale provided for this specific question, rather than any normative sense, or making any statement about probabilities. I think people interpret this word differently. I additionally think we have some questions that represent a broader swath of possible outcomes (e.g., TRS), taking a different position on the the parsimony and completeness frontier. I suspect we have different goals in mind for this question.
I think others would argue that the slow progress scenario is barely an improvement over current capabilities. Given the disagreement people have over current capabilities, this disagreement on how much progress a certain scenario represents will always exist. We notably had some people who take the opposite stance you do, that the slow progress scenario has already been achieved.
I would maintain that we can express these results as the probability that reality best matches a certain scenario, hence the needed addition of the ābest matchesā qualifier. So, Iām not following your points here, apologies.
And for what itās worth, I think the view that tasks = occupations is reasonably disputed. Again, I still grant the point that framing matters, and absolutely could be at play here. In fact, Iād argue that itās always at play everywhere, and we can and should do our best to limit its influence.
Appreciate the careful read. I responded to your post.
Thanks for digging in! Weāve gotten similar feedback on āsnack-sizedā insights and have it on our list.
Could you say more on āgenerally Iād encourage you to lean more into identifying underlying cruxes as opposed to quantitative estimatesā? Iām not sure I understand what this means in practice, because I think of the two as intimately related. This is likely a product of my view that cruxy questions have a high value of information (some FRI work on this here).
In case itās of interest, our risk-focused work tends to be in self-contained projects (example), so we can pull in respondents with intimate knowledge of the risk model. Nevertheless, weāll include some risk questions in future waves.
The 2 questions you mention were free text. We asked respondents to list, for example, cognitive limitations of AI. We then created a list of the most common responses to create a resolvable forecasting question for the subsequent wave.
(An author of the report here.) Thanks for engaging with this question and providing your feedback! Iāll provide a few of my thoughts. But, I will first note that EA forum posts by individuals affiliated with FRI do not constitute official positions.
I do think the following qualification we provided to forecasters (also noted by Benjamin) is important: Reasonable people may disagree with our characterization of what constitutes slow, moderate, or rapid AI progress. Or they may expect to see slow progress observed with some AI capabilities and moderate or fast progress in others. Nevertheless, we ask you to select which scenario, in sum, you feel best represents your views.
I would also agree with Benjamin that ābest matchingā covers scenarios with slower and faster progress than the slow and fast progress scenarios, respectively. And, I believe our panel is sophisticated and capable understanding this feature. Additionally, this question was not designed to capture the extreme possibilities for AI progress, and I personally wouldnāt use it to inform my views on these extreme possibilities (I think the mid-probability space is interesting and undeexplored, and we want LEAP to fill this gap). Given this, however, you are correct that we ought to include the ābest matchingā qualification when we present these results, and Iāve added this to our paper revision to-do list. Thanks for pointing that out.
I think other questions in the survey do a better job of covering the full range of possibilities, both in scenarios questions (i.e., TRS) and our more traditional, easily resolvable forecasting questions. The latter group comprise the vast majority of our surveys. I think itās impossible to write a single forecasting question that satisfies any reasonable and comprehensive set of desiderata, so Iād view LEAP as a portfolio of questions.
On edit #2, I would first note that it is challenging to write a set of scenarios for AI progress without an explosion of scenarios (and an associated increase in survey burden, which would itself degrade response quality); we face a tradeoff between parsimony and completeness. This specific question in the first survey is uniquely focused on parsimony, and we attempted to include questions that take other stances on that tradeoff. However, weād love to hear any suggestions you have for writing these types of questions, as we could certainly improve on this front. I think youāve identified many of the shortcomings in this specific question already. Second, I would defend our choice to present as probabilities (but we should add the ābest matchingā qualifier). Weāre making an appeal to intersubjective resolution. Witkowski et al. (2017) is one example, and some people at FRI have done similar work (Karger et al. 2021). These two metrics rely on wisdom-of-the-crowd effects. Again, however, I donāt think itās clear that weāre making this appeal, so Iāve added a note to clarify this in the paper. We use a resolution criterion (metaprediction) that some find unintuitive, but it allows us to incentivize this question. But, others might argue that incentives are less important.
While I think framing effects obviously matter in surveys, I do think that your edit #3 is conflating an elicitation/āmeasurement/āinstrumentation issue in low-probability forecasting with the broader phenomenon of framing, which I view as being primarily but not exclusively about question phrasing. Weāre including tests on framing and the elicitation environment in LEAP itself to make sure our results arenāt too sensitive to any framing effects, and weāll be sharing more on those in the future. Iād love to hear any ideas for experiments we should run there.
In sum, I largely defend the choices we made in writing this question. LEAP includes many different types of questions, because consumers of the research will have different views of the types of questions they will find informative and convincing. I will note that even within FRI some people personally find the scenarios questions much less compelling than the other questions in the survey. Nevertheless, I think you identified issues with our framing of the results, and we will make some changes. I appreciate you laying out your criticisms of the paper clearly so that we can dig into them, and Iād welcome any additional feedback!
I share a lot of Drewās skepticism about the study, especially the experimenter demand effects. If monitoring alone is enough to increase productivity, I think itās quite plausible that there is some further response (beyond a direct effect of the glasses on vision) to monitoring plus the provision of glasses. Even as a large proponent of quantile regression in many applications, I do think OLS is more appropriate for a cost effectiveness analysis. A median shift could be consistent both with a much larger or much smaller (even negative) impact on aggregate utility.
However, I do think the point about glasses as an experience good is a good one and could quite possibly be at play here. If getting glasses for work is not a normal activity, it could be easy to underestimate the benefits of doing so.
Thanks for all the helpful thoughts. The distinction between the growth work and treatment spillovers is useful. Growth theory doesnāt seem to be a huge field these days, whereas program evaluation is quite large.
[Question] EcoĀnomics PhD CanĀdiĀdacy Considerations
Hi everyone, Iām Connor. Iām an economics PhD student at UChicago. Iāve been tangentially interested in the EA movement for years, but Iāve started to invest more after reading What We Owe The Future. In about a month, Iām attending a summer course hosted by the Forethought Foundation, so I look forward to learning even more.
I intend to specialize in development and environmental economics, so Iām most interested in the global health and development focus area of EA. However, I look forward to learning more about other causes.
Iām also hoping to learn more about how to orient my research and work towards EA topics and engage with the community during my studies.
Do you view this as separate from the rationale data we also collect? One low-burden way to do this is to just include something like your text in the rationale prompt.