Do your opinion updates extend from individual forecasts to aggregated ones? In particular how reliable do you think is the Metaculus median AGI timeline?
On the one hand, my opinion of Metaculus predictions worsened as I saw how the ‘recent predictions’ showed people piling in on the median on some questions I watch. On the other hand, my opinion of Metaculus predictions improved as I found out that performance doesn’t seem to fall as a function of ‘resolve minus closing’ time (see https://twitter.com/tenthkrige/status/1296401128469471235). Are there some observations which have swayed your opinion in similar ways?
With regards to the AGI timeline, it’s important to note that Metaculus’ resolution criteria are quite different from a ‘standard’ interpretation of what would constitute AGI[1], (or human-level AI[2], superintelligence[3], transformative AI, etc.). It’s also unclear what proportion of forecasters have read this fine print (interested to hear others’ views on this), which further complicates interpretation.
For these purposes we will thus define “an artificial general intelligence” as a single unified software system that can satisfy the following criteria, all easily completable by a typical college-educated human.
Able to reliably pass a Turing test of the type that would win the Loebner Silver Prize.
Able to score 90% or more on a robust version of the Winograd Schema Challenge, e.g. the “Winogrande” challenge or comparable data set for which human performance is at 90+%
Be able to score 75th percentile (as compared to the corresponding year’s human students; this was a score of 600 in 2016) on all the full mathematics section of a circa-2015-2020 standard SAT exam, using just images of the exam pages and having less than ten SAT exams as part of the training data. (Training on other corpuses of math problems is fair game as long as they are arguably distinct from SAT exams.)
Be able to learn the classic Atari game “Montezuma’s revenge” (based on just visual inputs and standard controls) and explore all 24 rooms based on the equivalent of less than 100 hours of real-time play (see closely-related question.)
By “unified” we mean that the system is integrated enough that it can, for example, explain its reasoning on an SAT problem or Winograd schema question, or verbally report its progress and identify objects during videogame play. (This is not really meant to be an additional capability of “introspection” so much as a provision that the system not simply be cobbled together as a set of sub-systems specialized to tasks like the above, but rather a single system applicable to many problems.)
Agreed, I’ve been trying to help out a bit with Matt Barnett’s new question here. Feedback period is still open, so chime in if you have ideas!
I suspect most Metaculites are accustomed to paying attention to how a question’s operationalization deviates from its intent FWIW. Personally, I find the Montezuma’s revenge criterion quite important without which the question would be far from AGI.
My intent with bringing up this question, was more to ask about how Linch thinks about the reliability of long-term predictions with no obvious frequentist-friendly track record to look at.
On the one hand, my opinion of Metaculus predictions worsened as I saw how the ‘recent predictions’ showed people piling in on the median on some questions I watch.
Can you say more about this? I ask because this behavior seems consistent with an attitude of epistemic deference towards the community prediction when individual predictors perceive it to be superior to what they can themselves predict given their time and ability constraints.
Sure at an individual level deference usually makes for better predictions, but at a community level deference-as-the-norm can dilute the weight of those who are informed and predict differently from the median. Excessive numbers of deferential predictions also obfuscate how reliable the median prediction is, and thus makes it harder for others to do an informed update on the median.
As you say, it’s better if people contribute information where their relative value-add is greatest, so I’d say it’s reasonable for people to have a 2:1 ratio of questions on which they deviate from the median to questions on which they follow the median. My vague impression is that the ratio may be lower—especially for people predicting on <1 year time horizon events. I think you, linch and other heavier Metaculus users may have a more informed impression here though, so would be happy to see disagreement.
I think it would be interesting to have a Metaculus on which for every prediction you have to select a general category for your update e.g. “New Probability Calculation”, “Updated to Median”, “Information source released”, etc. Seeing the various distributions for each would likely be quite informative.
Do your opinion updates extend from individual forecasts to aggregated ones?
I think the best individual forecasters are on average better than the aggregate Metaculus forecasts at the moment they make the prediction. Especially if they spent a while on the prediction. I’m less sure if you account for prediction lag (The Metaculus and community predictions are usually better at incorporating new information), and my assessment for that will depend on a bunch of details.
In particular how reliable do you think is the Metaculus median AGI timeline?
I think as noted by matthew.vandermerwe, the Metaculus question operationalization for “AGI” is very different from what our community typically uses. I don’t have a strong opinion on whether a random AI Safety person will do better on that operationalization.
For something closer to what EAs care about, I’m pretty suspicious of the current forecasts given for existential risk/GCR estimates (for example in the Ragnarok series), and generally do not think existential risk researchers should strongly defer to them (though I suspect the forecasts/comments are good enough that it’s generally worth most xrisk researchers studying the relevant questions to read).
Do your opinion updates extend from individual forecasts to aggregated ones? In particular how reliable do you think is the Metaculus median AGI timeline?
On the one hand, my opinion of Metaculus predictions worsened as I saw how the ‘recent predictions’ showed people piling in on the median on some questions I watch. On the other hand, my opinion of Metaculus predictions improved as I found out that performance doesn’t seem to fall as a function of ‘resolve minus closing’ time (see https://twitter.com/tenthkrige/status/1296401128469471235). Are there some observations which have swayed your opinion in similar ways?
With regards to the AGI timeline, it’s important to note that Metaculus’ resolution criteria are quite different from a ‘standard’ interpretation of what would constitute AGI[1], (or human-level AI[2], superintelligence[3], transformative AI, etc.). It’s also unclear what proportion of forecasters have read this fine print (interested to hear others’ views on this), which further complicates interpretation.
OpenAI Charter
expert survey
Bostrom
Agreed, I’ve been trying to help out a bit with Matt Barnett’s new question here. Feedback period is still open, so chime in if you have ideas!
I suspect most Metaculites are accustomed to paying attention to how a question’s operationalization deviates from its intent FWIW. Personally, I find the Montezuma’s revenge criterion quite important without which the question would be far from AGI.
My intent with bringing up this question, was more to ask about how Linch thinks about the reliability of long-term predictions with no obvious frequentist-friendly track record to look at.
Can you say more about this? I ask because this behavior seems consistent with an attitude of epistemic deference towards the community prediction when individual predictors perceive it to be superior to what they can themselves predict given their time and ability constraints.
Sure at an individual level deference usually makes for better predictions, but at a community level deference-as-the-norm can dilute the weight of those who are informed and predict differently from the median. Excessive numbers of deferential predictions also obfuscate how reliable the median prediction is, and thus makes it harder for others to do an informed update on the median.
As you say, it’s better if people contribute information where their relative value-add is greatest, so I’d say it’s reasonable for people to have a 2:1 ratio of questions on which they deviate from the median to questions on which they follow the median. My vague impression is that the ratio may be lower—especially for people predicting on <1 year time horizon events. I think you, linch and other heavier Metaculus users may have a more informed impression here though, so would be happy to see disagreement.
I think it would be interesting to have a Metaculus on which for every prediction you have to select a general category for your update e.g. “New Probability Calculation”, “Updated to Median”, “Information source released”, etc. Seeing the various distributions for each would likely be quite informative.
I think the best individual forecasters are on average better than the aggregate Metaculus forecasts at the moment they make the prediction. Especially if they spent a while on the prediction. I’m less sure if you account for prediction lag (The Metaculus and community predictions are usually better at incorporating new information), and my assessment for that will depend on a bunch of details.
I think as noted by matthew.vandermerwe, the Metaculus question operationalization for “AGI” is very different from what our community typically uses. I don’t have a strong opinion on whether a random AI Safety person will do better on that operationalization.
For something closer to what EAs care about, I’m pretty suspicious of the current forecasts given for existential risk/GCR estimates (for example in the Ragnarok series), and generally do not think existential risk researchers should strongly defer to them (though I suspect the forecasts/comments are good enough that it’s generally worth most xrisk researchers studying the relevant questions to read).