Great post! Seems like the predictability questions is impt given how much power laws surface in discussion of EA stuff.
More precisely, future citations as well as awards (e.g. Nobel Prize) are predicted by past citations in a range of disciplines
I want to argue that things which look like predicting future citations from past citations are at least partially “uninteresting” in their predictability, in a certain important sense.
(I think this is related to other comments, and have not read your google doc, so apologies if I’m restating. But I think its worth drawing out this distinction)
In many cases I can think of wanting good ex-ante prediction of heavy-tailed outcomes, I want to make these predictions about a collection which is in an “early stage”. For example, I might want to predict which EAs will be successful academics, or which of 10 startups seed rounds I should invest in.
Having better predictive performance at earlier stages gives you a massive multiplier in heavy-tailed domains: investing in a Series C is dramatically more expensive than a seed investment.
Given this, I would really love to have a function which takes in the intrinsic characteristics of an object, and outputs a good prediction of performance.
Citations are not intrinsic characteristics.
When someone is choosing who to cite, they look at—among other things- how many citations they have. All else equal, a paper/author with more citations will get cited more than a paper with less citations. Given the limited attention span of academics (myself as case in point) the more highly cited paper will tend to get cited even if the alternative paper is objectively better.
(Ed Boyden at MIT has this idea of “hidden gems” in the literature which are extremely undercited papers with great ideas: I believe the original idea for PCR, a molecular bio technique, had been languishing for at least 5 years with very little attention before later rediscover. This is evidence for the failure of citations to track quality.)
Domains in which “the rich get richer” are known to follow heavy-tailed distributions (with an extra condition or two) by this story of preferential attachment.
In domains dominated by this effect we can predict ex-ante that the earliest settlers in a given “niche” are most likely to end up dominating the upper tail of the power law. But if the niche is empty, and we are asked to predict which of a set would be able to set up shop in the niche—based on intrinsic characteristics—we should be more skeptical of our predictive ability, it seems to me.
Besides citations, I’d argue that many/most other prestige-driven enterprises have at least a non-negligible component of their variance explained by preferential attachment. I don’t think it’s a coincidence that the oldest Universities in a geography also seem to be more prestigious, for example. This dynamic is also present in links on the interwebs and lots of other interesting places.
I’m currently most interested in how predictable heavy-tailed outcomes are before you have seen the citation-count analogue, because it seems like a lot of potentially valuable EA work is in niches which don’t exist yet.
That doesn’t mean the other type of predictability is useless, though. It seems like maybe on the margin we should actually be happier defaulting to making a bet on whichever option has accumulated the most “citations” to date instead of trusting our judgement of the intrinsic characteristics.
I think the case of citations / scientific success is a bit subtle:
My guess is that the preferential attachment story applies most straightforwardly at the level of papers rather than scientists. E.g. I would expect that scientists who want to cite something on topic X will cite the most-cited paper on X rather than first looking for papers on X and then looking up the total citations of their authors.
I think the Sinatra et al. (2016) findings which we discuss in our relevant section push at least slightly against a story that says it’s all just about “who was first in some niche”. In particular, if preferential attachment at the level of scientists was a key driver, then I would expect authors who get lucky early in their career—i.e. publish a much-cited paper early—to get more total citations. In particular, citations to future papers by a fixed scientist should depend on citations to past papers by the same scientist. But that is not what Sinatra et al. find—they instead find that within the career of a fixed scientist the per-paper citations seem entirely random.
Instead their model uses citations to estimate an ‘intrinsic characteristic’ that differs between scientists—what they call Q.
(I don’t think this is very strong evidence that such an intrinsic quality ‘exists’ because this is just how they choose the class of models they fit. Their model fits the data reasonably well, but we don’t know if a different model with different bells and whistles wouldn’t fit the data just as well or better. But note that, at least in my view, the idea that there are ability differences between scientists that correlate with citations looks likely on priors anyway, e.g. because of what we know about GMA/the ‘positive manifold’ of cognitive tasks or garden-variety impressions that some scientists just seem smarter than others.)
The International Maths Olympiad (IMO) paper seems like a clear example of our ability to measure an ‘intrinsic characteristic’ before we’ve seen the analog of a citation counts. IMO participants are high school students, and the paper finds that even among people who participated in the IMO in the same year and got their PhD from the same department IMO scores correlate with citations, awards, etc. Now, we might think that maybe maths is extreme in that success there depends unusually much on fluid intelligence or something like that, and I’m somewhat sympathetic to that point / think it’s partly correct. But on priors I would find it very surprising if this phenomenon was completely idiosyncratic to maths. Like, I’d be willing to bet that scores at the International Physics Olympiad, International Biology Olympiad, etc., as well as simply GMA or high school grades or whatever, correlate with future citations in the respective fields.
The IMO example is particularly remarkable because it’s in the extreme tail of performance. If we’re not particularly interested in the tail, then I think some of studies on more garden-variety predictors such as GMA or personality we cite in the relevant section give similar examples.
Interesting! Many great threads here. I definitely agree that some component of scientific achievement is predictable, and the IMO example is excellent evidence for this. Didn’t mean to imply any sort of disagreement with the premise that talent matters; I was instead pointing at a component of the variance in outcomes which follows different rules.
Fwiw, my actual bet is that to become a top-of-field academic you need both talent AND to get very lucky with early career buzz. The latter is an instantiation of preferential attachment. I’d guess for each top-of-field academic there are at least 10 similarly talented people who got unlucky in the paper lottery and didn’t have enough prestige to make it to the next stage in the process.
It sounds like I should probably just read Sinatra, but its quite surprising to me that publishing a highly cited paper early in one’s career isn’t correlated with larger total number of citations, at the high-performing tail (did I understand that right? Were they considering the right tail?). Anecdotally I notice that the top profs I know tend to have had a big paper/ discovery early. I.e. Ed Boyden who I have been thinking of because he has interesting takes on metascience, ~invented optogenetics in his PhD in 2005 (at least I think this was the story?) and it remains his most cited paper to this day by a factor of ~3.
On the scientist vs paper preferential attachment story, I could buy that. I was pondering while writing my comment how much is person-prestige driven vs. paper driven. I think for the most-part you’re right that its paper driven but I decided this caches out as effectively the same thing. My reasoning was if number of citations per paper is power law-ish then because citations per scientist is just the sum of these, it will be dominated by the top few papers. Therefore preferential attachment on the level of papers will produce “rich get richer” on the level of scientists, and this is still an example of the things because its not an intrinsic characteristic.
That said, my highly anecdotal experience is that there is actually a per-person effect at the very top. I’ve been lucky to work with George Church, one of the top profs in synthetic biology. Folks in the lab literally talk about “the George Effect” when submitting papers to top journals: the paper is more attractive simply because George’s name is on it.
But my sense is that I should look into some of the refs you provided! (thanks :)
its quite surprising to me that publishing a highly cited paper early in one’s career isn’t correlated with larger total number of citations, at the high-performing tail (did I understand that right? Were they considering the right tail?
No, they considered the full distribution of scientists with long careers and sustained publication activity (which themselves form the tail of the larger population of everyone with a PhD).
That is, their analysis includes the right tail but wasn’t exclusively focused on it. Since by its very nature there will only be few data points in the right tail, it won’t have a lot of weight when fitting their model. So it could in principle be the case that if we looked only at the right tail specifically this would suggest a different model.
It is certainly possible that early successes may play a larger causal role in the extreme right tail—we often find distributions that are mostly log-normal, but with a power-law tail, suggesting that the extreme tail may follow different dynamics.
Relatedly, you might be interested in these two footnotes discussing how impressive it is that Sinatra et al. (2016) - the main paper we discuss in the doc—can predict the evolution of the Hirsch index (a citation measure) over a full career based on the the Hirsch index after the 20 or 50 papers:
Note that the evolution of the Hirsch index depends on two things: (i) citations to future papers and (ii) the evolution of citations to past papers. It seems easier to predict (ii) than (i), but we care more about (i). This raises the worry that predictions of the Hirsch index are a poor proxy of what we care about – predicting citations to future work – because successful predictions of the Hirsch index may work largely by predicting (ii) but not (i). This does make Sinatra and colleagues’ ability to predict the Hirsch index less impressive and useful, but the worry is attenuated by two observations: first, the internal validity of their model for predicting successful scientific careers is independently supported by its ability to predict Nobel prizes and other awards; second, they can predict the Hirsch index over a very long period, when it is increasingly dominated by future work rather than accumulating citations to past work.
Acuna, Allesina, & Kording (2012) had previously proposed a simple linear model for predicting scientists’ Hirsch index. However, the validity of their model for the purpose of predicting the quality of future work is undermined more strongly by the worry explained in the previous footnote; in addition, the reported validity of their model is inflated by their heterogeneous sample that, unlike the sample analyzed by Sinatra et al. (2016), contains both early- and late-career scientists. (Both points were observed by Penner et al. 2013.)
Neat. I’d be curious if anyone has tried blinding the predictive algorithm to prestige: ie no past citation information or journal impact factors. And instead strictly use paper content (sounds like a project for GPT-6).
It might be interesting also to think about how talent vs. prestige-based models explain the cases of scientists whose work was groundbreaking but did not garner attention at the time. I’m thinking, e.g. of someone like Kjell Keppe who basically described PCR, the foundational molbio method, a decade early.
If you look at natural experiments in which two groups publish the ~same thing, but only one makes the news, the fully talent-based model (I think?) predicts that there should not be a significant difference between citations and other markers of academic success (unless your model of talent is including something about marketing which seems like a stretch to me).
Ed Boyden at MIT has this idea of “hidden gems” in the literature which are extremely undercited papers with great ideas: I believe the original idea for PCR, a molecular bio technique, had been languishing for at least 5 years with very little attention before later rediscover.
A related phenomenon has been studied in the scientometrics literature under the label ‘sleeping beauties’.
Here is what Clauset et al. (2017, pp. 478f.) say in their review of the scientometrics/‘science of science’ field:
However, some discoveries do not follow these rules, and the exceptions demonstrate that there can be more to scientific impact than visibility, luck, and positive feedback. For instance, some papers far exceed the predictions made by sim- ple preferential attachment (5, 6). And then there are the “sleeping beauties” in science: discoveries that lay dormant and largely unnoticed for long periods of time before suddenly attracting great attention (7–9). A systematic analysis of nearly 25 million publications in the natural and social sciences over the past 100 years found that sleeping beauties occur in all fields of study (9).
Examples include a now famous 1935 paper by Einstein, Podolsky, and Rosen on quantum me- chanics; a 1936 paper by Wenzel on waterproofing materials; and a 1958 paper by Rosenblatt on artificial neural networks. The awakening of slumbering papers may be fundamentally un- predictable in part because science itself must advance before the implications of the discovery can unfold.
The awakening of slumbering papers may be fundamentally un- predictable in part because science itself must advance before the implications of the discovery can unfold.
Except to the authors themselves, who may often have an inkling that their paper is important. E.g., I think Rosenblatt was incredibly excited/convinced about the insights in that sleeping beauty paper. (Small chance my memory is wrong about this, or that he changed his mind at some point.)
I don’t think this is just a nitpicky comment on the passage you quoted. I find it plausible that there’s some hard-to-study quantity around ‘research taste’ that predicts impact quite well. It’d be hard to study because the hypothesis is that only very few people have it. To tell who has it, you kind of need to have it a bit yourself. But one decent way to measure it is asking people who are universally regarded as ‘having it’ to comment on who else they think also has it. (I know this process would lead to unfair network effects and may result in false negatives and so on, but I’m advancing a descriptive observation here; I’m not advocating for a specific system on how to evaluate individuals.)
Related: I remember a comment (can’t find it anymore) somewhere by Liv Boeree or some other poker player familiar with EA. The commenter explained that monetary results aren’t the greatest metric for assessing the skill of top poker players. Instead, it’s best to go with assessments by expert peers. (I think this holds mostly for large-field tournaments, not online cash games.)
Related: I remember a comment (can’t find it anymore) somewhere by Liv Boeree or some other poker player familiar with EA. The commenter explained that monetary results aren’t the greatest metric for assessing the skill of top poker players. Instead, it’s best to go with assessments by expert peers. (I think this holds mostly for large-field tournaments, not online cash games.)
If I remember correctly, Linchuan Zhang made or referred to that comment somewhere on the Forum when saying that it was similar with assessing forecaster skill. (Or maybe it was you? :P)
I have indeed made that comment somewhere. It was one of the more insightful/memorable comments she made when I interviewed her, but tragically I didn’t end up writing down that question in the final document (maybe due to my own lack of researcher taste? :P)
That said, human memory is fallible etc so maybe it’d be worthwhile to circle back to Liv and ask if she still endorses this, and/or ask other poker players how much they agree with it.
I’ve been much less successful than LivB but would endorse it, though I’d note that there are substantially better objective metrics than cash prizes for many kinds of online play, and I’d have a harder time arguing that those were less reliable than subjective judgements of other good players. It somewhat depends on sample though, at the highest stakes the combination of v small playerpool and fairly small samples make this quite believable.
To give an example of what would go into research taste, consider the issue of reference class tennis (rationalist jargon for arguments on whether a given analogy has merit, or two people throwing widely different analogies at each other in an argument). That issue comes up a lot especially in preparadigmatic branches of science. Some people may have good intuitions about this sort of thing, while others may be hopelessly bad at it. Since arguments of that form feel notoriously intractable to outsiders, it would make sense if “being good at reference class tennis” were a skill that’s hard to evaluate.
Great post! Seems like the predictability questions is impt given how much power laws surface in discussion of EA stuff.
I want to argue that things which look like predicting future citations from past citations are at least partially “uninteresting” in their predictability, in a certain important sense.
(I think this is related to other comments, and have not read your google doc, so apologies if I’m restating. But I think its worth drawing out this distinction)
In many cases I can think of wanting good ex-ante prediction of heavy-tailed outcomes, I want to make these predictions about a collection which is in an “early stage”. For example, I might want to predict which EAs will be successful academics, or which of 10 startups seed rounds I should invest in.
Having better predictive performance at earlier stages gives you a massive multiplier in heavy-tailed domains: investing in a Series C is dramatically more expensive than a seed investment.
Given this, I would really love to have a function which takes in the intrinsic characteristics of an object, and outputs a good prediction of performance.
Citations are not intrinsic characteristics.
When someone is choosing who to cite, they look at—among other things- how many citations they have. All else equal, a paper/author with more citations will get cited more than a paper with less citations. Given the limited attention span of academics (myself as case in point) the more highly cited paper will tend to get cited even if the alternative paper is objectively better.
(Ed Boyden at MIT has this idea of “hidden gems” in the literature which are extremely undercited papers with great ideas: I believe the original idea for PCR, a molecular bio technique, had been languishing for at least 5 years with very little attention before later rediscover. This is evidence for the failure of citations to track quality.)
Domains in which “the rich get richer” are known to follow heavy-tailed distributions (with an extra condition or two) by this story of preferential attachment.
In domains dominated by this effect we can predict ex-ante that the earliest settlers in a given “niche” are most likely to end up dominating the upper tail of the power law. But if the niche is empty, and we are asked to predict which of a set would be able to set up shop in the niche—based on intrinsic characteristics—we should be more skeptical of our predictive ability, it seems to me.
Besides citations, I’d argue that many/most other prestige-driven enterprises have at least a non-negligible component of their variance explained by preferential attachment. I don’t think it’s a coincidence that the oldest Universities in a geography also seem to be more prestigious, for example. This dynamic is also present in links on the interwebs and lots of other interesting places.
I’m currently most interested in how predictable heavy-tailed outcomes are before you have seen the citation-count analogue, because it seems like a lot of potentially valuable EA work is in niches which don’t exist yet.
That doesn’t mean the other type of predictability is useless, though. It seems like maybe on the margin we should actually be happier defaulting to making a bet on whichever option has accumulated the most “citations” to date instead of trusting our judgement of the intrinsic characteristics.
Anyhoo- thanks again for looking into this!
Thanks! I agree with a lot of this.
I think the case of citations / scientific success is a bit subtle:
My guess is that the preferential attachment story applies most straightforwardly at the level of papers rather than scientists. E.g. I would expect that scientists who want to cite something on topic X will cite the most-cited paper on X rather than first looking for papers on X and then looking up the total citations of their authors.
I think the Sinatra et al. (2016) findings which we discuss in our relevant section push at least slightly against a story that says it’s all just about “who was first in some niche”. In particular, if preferential attachment at the level of scientists was a key driver, then I would expect authors who get lucky early in their career—i.e. publish a much-cited paper early—to get more total citations. In particular, citations to future papers by a fixed scientist should depend on citations to past papers by the same scientist. But that is not what Sinatra et al. find—they instead find that within the career of a fixed scientist the per-paper citations seem entirely random.
Instead their model uses citations to estimate an ‘intrinsic characteristic’ that differs between scientists—what they call Q.
(I don’t think this is very strong evidence that such an intrinsic quality ‘exists’ because this is just how they choose the class of models they fit. Their model fits the data reasonably well, but we don’t know if a different model with different bells and whistles wouldn’t fit the data just as well or better. But note that, at least in my view, the idea that there are ability differences between scientists that correlate with citations looks likely on priors anyway, e.g. because of what we know about GMA/the ‘positive manifold’ of cognitive tasks or garden-variety impressions that some scientists just seem smarter than others.)
The International Maths Olympiad (IMO) paper seems like a clear example of our ability to measure an ‘intrinsic characteristic’ before we’ve seen the analog of a citation counts. IMO participants are high school students, and the paper finds that even among people who participated in the IMO in the same year and got their PhD from the same department IMO scores correlate with citations, awards, etc. Now, we might think that maybe maths is extreme in that success there depends unusually much on fluid intelligence or something like that, and I’m somewhat sympathetic to that point / think it’s partly correct. But on priors I would find it very surprising if this phenomenon was completely idiosyncratic to maths. Like, I’d be willing to bet that scores at the International Physics Olympiad, International Biology Olympiad, etc., as well as simply GMA or high school grades or whatever, correlate with future citations in the respective fields.
The IMO example is particularly remarkable because it’s in the extreme tail of performance. If we’re not particularly interested in the tail, then I think some of studies on more garden-variety predictors such as GMA or personality we cite in the relevant section give similar examples.
Interesting! Many great threads here. I definitely agree that some component of scientific achievement is predictable, and the IMO example is excellent evidence for this. Didn’t mean to imply any sort of disagreement with the premise that talent matters; I was instead pointing at a component of the variance in outcomes which follows different rules.
Fwiw, my actual bet is that to become a top-of-field academic you need both talent AND to get very lucky with early career buzz. The latter is an instantiation of preferential attachment. I’d guess for each top-of-field academic there are at least 10 similarly talented people who got unlucky in the paper lottery and didn’t have enough prestige to make it to the next stage in the process.
It sounds like I should probably just read Sinatra, but its quite surprising to me that publishing a highly cited paper early in one’s career isn’t correlated with larger total number of citations, at the high-performing tail (did I understand that right? Were they considering the right tail?). Anecdotally I notice that the top profs I know tend to have had a big paper/ discovery early. I.e. Ed Boyden who I have been thinking of because he has interesting takes on metascience, ~invented optogenetics in his PhD in 2005 (at least I think this was the story?) and it remains his most cited paper to this day by a factor of ~3.
On the scientist vs paper preferential attachment story, I could buy that. I was pondering while writing my comment how much is person-prestige driven vs. paper driven. I think for the most-part you’re right that its paper driven but I decided this caches out as effectively the same thing. My reasoning was if number of citations per paper is power law-ish then because citations per scientist is just the sum of these, it will be dominated by the top few papers. Therefore preferential attachment on the level of papers will produce “rich get richer” on the level of scientists, and this is still an example of the things because its not an intrinsic characteristic.
That said, my highly anecdotal experience is that there is actually a per-person effect at the very top. I’ve been lucky to work with George Church, one of the top profs in synthetic biology. Folks in the lab literally talk about “the George Effect” when submitting papers to top journals: the paper is more attractive simply because George’s name is on it.
But my sense is that I should look into some of the refs you provided! (thanks :)
No, they considered the full distribution of scientists with long careers and sustained publication activity (which themselves form the tail of the larger population of everyone with a PhD).
That is, their analysis includes the right tail but wasn’t exclusively focused on it. Since by its very nature there will only be few data points in the right tail, it won’t have a lot of weight when fitting their model. So it could in principle be the case that if we looked only at the right tail specifically this would suggest a different model.
It is certainly possible that early successes may play a larger causal role in the extreme right tail—we often find distributions that are mostly log-normal, but with a power-law tail, suggesting that the extreme tail may follow different dynamics.
Sorry meant to write “component of scientific achievement is predictable from intrinsic characteristics” in that first line
Relatedly, you might be interested in these two footnotes discussing how impressive it is that Sinatra et al. (2016) - the main paper we discuss in the doc—can predict the evolution of the Hirsch index (a citation measure) over a full career based on the the Hirsch index after the 20 or 50 papers:
Neat. I’d be curious if anyone has tried blinding the predictive algorithm to prestige: ie no past citation information or journal impact factors. And instead strictly use paper content (sounds like a project for GPT-6).
It might be interesting also to think about how talent vs. prestige-based models explain the cases of scientists whose work was groundbreaking but did not garner attention at the time. I’m thinking, e.g. of someone like Kjell Keppe who basically described PCR, the foundational molbio method, a decade early.
If you look at natural experiments in which two groups publish the ~same thing, but only one makes the news, the fully talent-based model (I think?) predicts that there should not be a significant difference between citations and other markers of academic success (unless your model of talent is including something about marketing which seems like a stretch to me).
A related phenomenon has been studied in the scientometrics literature under the label ‘sleeping beauties’.
Here is what Clauset et al. (2017, pp. 478f.) say in their review of the scientometrics/‘science of science’ field:
[See doc linked in the OP for full reference.]
Except to the authors themselves, who may often have an inkling that their paper is important. E.g., I think Rosenblatt was incredibly excited/convinced about the insights in that sleeping beauty paper. (Small chance my memory is wrong about this, or that he changed his mind at some point.)
I don’t think this is just a nitpicky comment on the passage you quoted. I find it plausible that there’s some hard-to-study quantity around ‘research taste’ that predicts impact quite well. It’d be hard to study because the hypothesis is that only very few people have it. To tell who has it, you kind of need to have it a bit yourself. But one decent way to measure it is asking people who are universally regarded as ‘having it’ to comment on who else they think also has it. (I know this process would lead to unfair network effects and may result in false negatives and so on, but I’m advancing a descriptive observation here; I’m not advocating for a specific system on how to evaluate individuals.)
Related: I remember a comment (can’t find it anymore) somewhere by Liv Boeree or some other poker player familiar with EA. The commenter explained that monetary results aren’t the greatest metric for assessing the skill of top poker players. Instead, it’s best to go with assessments by expert peers. (I think this holds mostly for large-field tournaments, not online cash games.)
If I remember correctly, Linchuan Zhang made or referred to that comment somewhere on the Forum when saying that it was similar with assessing forecaster skill. (Or maybe it was you? :P)
I have indeed made that comment somewhere. It was one of the more insightful/memorable comments she made when I interviewed her, but tragically I didn’t end up writing down that question in the final document (maybe due to my own lack of researcher taste? :P)
That said, human memory is fallible etc so maybe it’d be worthwhile to circle back to Liv and ask if she still endorses this, and/or ask other poker players how much they agree with it.
I’ve been much less successful than LivB but would endorse it, though I’d note that there are substantially better objective metrics than cash prizes for many kinds of online play, and I’d have a harder time arguing that those were less reliable than subjective judgements of other good players. It somewhat depends on sample though, at the highest stakes the combination of v small playerpool and fairly small samples make this quite believable.
To give an example of what would go into research taste, consider the issue of reference class tennis (rationalist jargon for arguments on whether a given analogy has merit, or two people throwing widely different analogies at each other in an argument). That issue comes up a lot especially in preparadigmatic branches of science. Some people may have good intuitions about this sort of thing, while others may be hopelessly bad at it. Since arguments of that form feel notoriously intractable to outsiders, it would make sense if “being good at reference class tennis” were a skill that’s hard to evaluate.
Yeah this is great; I think Ed probably called them sleeping beauties and I was just misremembering :)
Thanks for the references!