[ETA: See also this summary of our findings + potential lessons by Ben for the 80k blog.]
Some people seem to achieve orders of magnitudes more than others in the same job. For instance, among companies funded by Y Combinator the top 0.5% account for more than ⅔ of the total market value; and among successful bestseller authors, the top 1% stay on the New York Times bestseller list more than 25 times longer than the median author in that group.
This is a striking and often unappreciated fact, but raises many questions. How many jobs have these huge differences in achievements? More importantly, why can achievements differ so much, and can we identify future top performers in advance? Are some people much more talented? Have they spent more time practicing key skills? Did they have more supportive environments, or start with more resources? Or did the top performers just get lucky?
More precisely, when recruiting, for instance, we’d want to know the following: when predicting the future performance of different people in a given job, what does the distribution of predicted (‘ex-ante’) performance look like?
This is an important question for EA community building and hiring. For instance, if it’s possible to identify people who will be able to have a particularly large positive impact on the world ahead of time, we’d likely want to take a more targeted approach to outreach.
More concretely, we may be interested in two different ways in which we could encounter large performance differences:
If we look at a random person, by how much should we expect their performance to differ from the average?
What share of total output should we expect to come from the small fraction of people we’re most optimistic about (say, the top 1% or top 0.1%) – that is, how heavy-tailed is the distribution of ex-ante performance?
(See this appendix for how these two notions differ from each other.)
Depending on the decision we’re facing we might be more interested in one or the other. Here we mostly focused on the second question, i.e., on how heavy the tails are.
This post contains our findings from a shallow literature review and theoretical arguments. Max was the lead author, building on some initial work by Ben, who also provided several rounds of comments.
You can see a short summary of our findings below.
We expect this post to be useful for:
(Primarily:) Junior EA researcherswho want to do further research in this area. See in particular the section on Further research.
(Secondarily:) EA decision-makers who want to get a rough sense of what we do and don’t know about predicting performance. See in particular this summary and the bolded parts in our section on Findings.
We weren’t maximally diligent with double-checking our spreadsheets etc.; if you wanted to rely heavily on a specific number we give, you might want to do additional vetting.
To determine the distribution of predicted performance, we proceed in two steps:
We start with how ex-post performance is distributed. That is, how much did the performance of different people vary when we look back at completed tasks? On these questions, we’ll review empirical evidence on both typical jobs and expert performance (e.g. research).
Then we ask how ex-ante performance is distributed. That is, when we employ our best methods to predict future performance by different people, how will these predictions vary? On these questions, we review empirical evidence on measurable factors correlating with performance as well as the implications of theoretical considerations on which kinds of processes will generate different types of distributions.
Here we adopt a very loose conception of performance that includes both short-term (e.g. sales made on one day) and long-term achievements (e.g. citations over a whole career). We also allow for performance metrics to be influenced by things beyond the performer’s control.
Our overall bottom lines are:
Ex-post performance appears ‘heavy-tailed’ in many relevant domains, but with very large differences in how heavy-tailed: the top 1% account for between 4% to over 80% of the total. For instance, we find ‘heavy-tailed’ distributions (e.g. log-normal, power law) of scientific citations, startup valuations, income, and media sales. By contrast, a large meta-analysis reports ‘thin-tailed’ (Gaussian) distributions for ex-post performance in less complex jobs such as cook or mail carrier[1]: the top 1% account for 3-3.7% of the total. These figures illustrate that the difference between ‘thin-tailed’ and ‘heavy-tailed’ distributions can be modest in the range that matters in practice, while differences between ‘heavy-tailed’ distributions can be massive. (More.)
Ex-ante performance is heavy-tailed in at least one relevant domain: science. More precisely, future citations as well as awards (e.g. Nobel Prize) are predicted by past citations in a range of disciplines, and in mathematics by scores at the International Maths Olympiad.(More.)
More broadly, there are known, measurable correlates of performance in many domains (e.g. general mental ability). Several of them appear to remain valid in the tails. (More.)
However, these correlations by itself don’t tell us much about the shape of the ex-ante performance distribution: in particular, they would be consistent with either thin-tailed or heavy-tailed ex-ante performance. (More.)
Uncertainty should move us toward acting as if ex-ante performance was heavy-tailed – because if you have some credence in it being heavy-tailed, it’s heavy-tailed in expectation – but not all the way, and less so the smaller our credence in heavy-tails. (More.)
To infer the shape of the ex-ante performance distribution, it would be more useful to have a mechanistic understanding of the process generating performance, but such fine-grainedcausal theories of performance are rarely available. (More.)
Nevertheless, our best guess is that moderately to extremely heavy-tailed ex-ante performance is widespread at least for ‘complex’ and ‘scaleable’ tasks. (I.e. ones where the performance metric can in practice range over many orders of magnitude and isn’t artificially truncated.) This is based on our best guess at the causal processes that generate performance combined with the empirical data we’ve seen. However, we think this is debatable rather than conclusively established by the literature we reviewed. (More.)
There are several opportunities for valuable further research. (More.)
Overall, doing this investigation probably made us a little less confident that highly heavy-tailed distributions of ex-ante performance are widespread, and we think that common arguments for it are often too quick. That said, we still think there are often large differences in performance (e.g. some software engineers have 10-times the output of others[2]), these are somewhat predictable, and it’s often reasonable to act on the assumption that the ex-ante distribution is heavy-tailed in many relevant domains (broadly, when dealing with something like ‘expert’ performance as opposed to ‘typical’ jobs).
Some advice for how to work with these concepts in practice:
In practice, don’t treat ‘heavy-tailed’ as a binary property. Instead, ask how heavy the tails of some quantity of interest are, for instance by identifying the frequency of outliers you’re interested in (e.g. top 1%, top 0.1%, …) and comparing them to the median or looking at their share of the total.[3]
Carefully choose the underlying population and the metric for performance, in a way that’s tailored to the purpose of your analysis. In particular, be mindful of whether you’re looking at the full distribution or some tail (e.g. wealth of all citizens vs. wealth of billionaires).
In an appendix, we provide more detail on some background considerations:
The conceptual difference between ‘high variance’ and ‘heavy tails’: Neither property implies the other. Both mean that unusually good opportunities are much better than typical ones. However, only heavy tails imply that outliers account for a large share of the total, and that naive extrapolation underestimates the size of future outliers. (More.)
We can often distinguish heavy-tailed from light-tailed data by eyeballing (e.g. in a log-log plot), but it’s hard to empirically distinguish different heavy-tailed distributions from one another (e.g. log-normal vs. power laws). When extrapolating beyond the range of observed data, we advise to proceed with caution and to not take the specific distributions reported in papers at face value. (More.)
There is a small number of papers in industrial-organizational psychology on the specific question whether performance in typical jobs is normally distributed or heavy-tailed. However, we don’t give much weight to these papers because their broad high-level conclusion (“it depends”) is obvious but we have doubts about the statistical methods behind their more specific claims. (More.)
We also quote (in more detail than in the main text) the results from a meta-analysis of predictors of salary, promotions, and career satisfaction. (More.)
We provide a technical discussion of how our metrics for heavy-tailedness are affected by the ‘cutoff’ value at which the tail starts. (More.)
Finally, we provide a glossary of the key terms we use, such as performance or heavy-tailed.
We’d like to thank Owen Cotton-Barratt and Denise Melchin for helpful comments on earlier drafts of our write-up, as well as Aaron Gertler for advice on how to best post this piece on the Forum.
Most of Max’s work on this project was done while he was part of the Research Scholars Programme (RSP) at FHI, and he’s grateful to the RSP management and FHI operations teams for keeping FHI/RSP running, and to Hamish Hobbs and Nora Ammann for support with productivity and accountability.
We’re also grateful to Balsa Delibasic for compiling and formatting the reference list.
[2] Similarly, don’t treat ‘heavy-tailed’ as an asymptotic property – i.e. one that by definition need only hold for values above some arbitrarily large value. Instead, consider the range of values that matter in practice. For instance, a distribution that exhibits heavy tails only for values greater than 10^100 would be heavy-tailed in the asymptotic sense. But for e.g. income in USD values like 10^100 would never show up in practice – if your distribution is supposed to correspond to income in USD you’d only be interested in a much smaller range, say up to 10^10. Note that this advice is in contrast to the standard definition of ‘heavy-tailed’ in mathematical contexts, where it usually is defined as an asymptotic property. Relatedly, a distribution that only takes values in some finite range – e.g. between 0 and 10 billion – is never heavy-tailed in the mathematical-asymptotic sense, but it may well be in the “practical” sense (where you anyway cannot empirically distinguish between a distribution that can take arbitrarily large values and one that is “cut off” beyond some very large maximum).
For performance in “high-complexity” jobs such as attorney or physician, that meta-analysis (Hunter et al. 1990) reports a coefficient of variation that’s about 1.5x as large as for ‘medium-complexity’ jobs. Unfortunately, we can’t calculate how heavy-tailed the performance distribution for high-complexity jobs is: for this we would need to stipulate a particular type of distribution (e.g. normal, log-normal), but Hunter et al. only report that the distribution does not appear to be normal (unlike for the low- and medium-complexity cases).
Claims about a 10x output gap between the best and average programmers are very common, as evident from a Google search for ‘10x developer’. In terms of value rather than quantity of output, the WSJ has reported a Google executive claiming a 300x difference. For a discussion of such claims see, for instance, this blog post by Georgia Institute of Technology professor Mark Guzdial. Similarly, slide 37 of this version of Netflix’s influential ‘culture deck’ claims (without source) that “In creative/inventive work, the best are 10x better than the average”.
Similarly, don’t treat ‘heavy-tailed’ as an asymptotic property – i.e. one that by definition need only hold for values above some arbitrarily large value. Instead, consider the range of values that matter in practice. For instance, a distribution that exhibits heavy tails only for values greater than 10^100 would be heavy-tailed in the asymptotic sense. But for e.g. income in USD values like 10^100 would never show up in practice – if your distribution is supposed to correspond to income in USD you’d only be interested in a much smaller range, say up to 10^10. Note that this advice is in contrast to the standard definition of ‘heavy-tailed’ in mathematical contexts, where it usually is defined as an asymptotic property. Relatedly, a distribution that only takes values in some finite range – e.g. between 0 and 10 billion – is never heavy-tailed in the mathematical-asymptotic sense, but it may well be in the “practical” sense (where you anyway cannot empirically distinguish between a distribution that can take arbitrarily large values and one that is “cut off” beyond some very large maximum).
How much does performance differ between people?
by Max Daniel & Benjamin Todd
[ETA: See also this summary of our findings + potential lessons by Ben for the 80k blog.]
Some people seem to achieve orders of magnitudes more than others in the same job. For instance, among companies funded by Y Combinator the top 0.5% account for more than ⅔ of the total market value; and among successful bestseller authors, the top 1% stay on the New York Times bestseller list more than 25 times longer than the median author in that group.
This is a striking and often unappreciated fact, but raises many questions. How many jobs have these huge differences in achievements? More importantly, why can achievements differ so much, and can we identify future top performers in advance? Are some people much more talented? Have they spent more time practicing key skills? Did they have more supportive environments, or start with more resources? Or did the top performers just get lucky?
More precisely, when recruiting, for instance, we’d want to know the following: when predicting the future performance of different people in a given job, what does the distribution of predicted (‘ex-ante’) performance look like?
This is an important question for EA community building and hiring. For instance, if it’s possible to identify people who will be able to have a particularly large positive impact on the world ahead of time, we’d likely want to take a more targeted approach to outreach.
More concretely, we may be interested in two different ways in which we could encounter large performance differences:
If we look at a random person, by how much should we expect their performance to differ from the average?
What share of total output should we expect to come from the small fraction of people we’re most optimistic about (say, the top 1% or top 0.1%) – that is, how heavy-tailed is the distribution of ex-ante performance?
(See this appendix for how these two notions differ from each other.)
Depending on the decision we’re facing we might be more interested in one or the other. Here we mostly focused on the second question, i.e., on how heavy the tails are.
This post contains our findings from a shallow literature review and theoretical arguments. Max was the lead author, building on some initial work by Ben, who also provided several rounds of comments.
You can see a short summary of our findings below.
We expect this post to be useful for:
(Primarily:) Junior EA researchers who want to do further research in this area. See in particular the section on Further research.
(Secondarily:) EA decision-makers who want to get a rough sense of what we do and don’t know about predicting performance. See in particular this summary and the bolded parts in our section on Findings.
We weren’t maximally diligent with double-checking our spreadsheets etc.; if you wanted to rely heavily on a specific number we give, you might want to do additional vetting.
To determine the distribution of predicted performance, we proceed in two steps:
We start with how ex-post performance is distributed. That is, how much did the performance of different people vary when we look back at completed tasks? On these questions, we’ll review empirical evidence on both typical jobs and expert performance (e.g. research).
Then we ask how ex-ante performance is distributed. That is, when we employ our best methods to predict future performance by different people, how will these predictions vary? On these questions, we review empirical evidence on measurable factors correlating with performance as well as the implications of theoretical considerations on which kinds of processes will generate different types of distributions.
Here we adopt a very loose conception of performance that includes both short-term (e.g. sales made on one day) and long-term achievements (e.g. citations over a whole career). We also allow for performance metrics to be influenced by things beyond the performer’s control.
Our overall bottom lines are:
Ex-post performance appears ‘heavy-tailed’ in many relevant domains, but with very large differences in how heavy-tailed: the top 1% account for between 4% to over 80% of the total. For instance, we find ‘heavy-tailed’ distributions (e.g. log-normal, power law) of scientific citations, startup valuations, income, and media sales. By contrast, a large meta-analysis reports ‘thin-tailed’ (Gaussian) distributions for ex-post performance in less complex jobs such as cook or mail carrier[1]: the top 1% account for 3-3.7% of the total. These figures illustrate that the difference between ‘thin-tailed’ and ‘heavy-tailed’ distributions can be modest in the range that matters in practice, while differences between ‘heavy-tailed’ distributions can be massive. (More.)
Ex-ante performance is heavy-tailed in at least one relevant domain: science. More precisely, future citations as well as awards (e.g. Nobel Prize) are predicted by past citations in a range of disciplines, and in mathematics by scores at the International Maths Olympiad. (More.)
More broadly, there are known, measurable correlates of performance in many domains (e.g. general mental ability). Several of them appear to remain valid in the tails. (More.)
However, these correlations by itself don’t tell us much about the shape of the ex-ante performance distribution: in particular, they would be consistent with either thin-tailed or heavy-tailed ex-ante performance. (More.)
Uncertainty should move us toward acting as if ex-ante performance was heavy-tailed – because if you have some credence in it being heavy-tailed, it’s heavy-tailed in expectation – but not all the way, and less so the smaller our credence in heavy-tails. (More.)
To infer the shape of the ex-ante performance distribution, it would be more useful to have a mechanistic understanding of the process generating performance, but such fine-grained causal theories of performance are rarely available. (More.)
Nevertheless, our best guess is that moderately to extremely heavy-tailed ex-ante performance is widespread at least for ‘complex’ and ‘scaleable’ tasks. (I.e. ones where the performance metric can in practice range over many orders of magnitude and isn’t artificially truncated.) This is based on our best guess at the causal processes that generate performance combined with the empirical data we’ve seen. However, we think this is debatable rather than conclusively established by the literature we reviewed. (More.)
There are several opportunities for valuable further research. (More.)
Overall, doing this investigation probably made us a little less confident that highly heavy-tailed distributions of ex-ante performance are widespread, and we think that common arguments for it are often too quick. That said, we still think there are often large differences in performance (e.g. some software engineers have 10-times the output of others[2]), these are somewhat predictable, and it’s often reasonable to act on the assumption that the ex-ante distribution is heavy-tailed in many relevant domains (broadly, when dealing with something like ‘expert’ performance as opposed to ‘typical’ jobs).
Some advice for how to work with these concepts in practice:
In practice, don’t treat ‘heavy-tailed’ as a binary property. Instead, ask how heavy the tails of some quantity of interest are, for instance by identifying the frequency of outliers you’re interested in (e.g. top 1%, top 0.1%, …) and comparing them to the median or looking at their share of the total.[3]
Carefully choose the underlying population and the metric for performance, in a way that’s tailored to the purpose of your analysis. In particular, be mindful of whether you’re looking at the full distribution or some tail (e.g. wealth of all citizens vs. wealth of billionaires).
In an appendix, we provide more detail on some background considerations:
The conceptual difference between ‘high variance’ and ‘heavy tails’: Neither property implies the other. Both mean that unusually good opportunities are much better than typical ones. However, only heavy tails imply that outliers account for a large share of the total, and that naive extrapolation underestimates the size of future outliers. (More.)
We can often distinguish heavy-tailed from light-tailed data by eyeballing (e.g. in a log-log plot), but it’s hard to empirically distinguish different heavy-tailed distributions from one another (e.g. log-normal vs. power laws). When extrapolating beyond the range of observed data, we advise to proceed with caution and to not take the specific distributions reported in papers at face value. (More.)
There is a small number of papers in industrial-organizational psychology on the specific question whether performance in typical jobs is normally distributed or heavy-tailed. However, we don’t give much weight to these papers because their broad high-level conclusion (“it depends”) is obvious but we have doubts about the statistical methods behind their more specific claims. (More.)
We also quote (in more detail than in the main text) the results from a meta-analysis of predictors of salary, promotions, and career satisfaction. (More.)
We provide a technical discussion of how our metrics for heavy-tailedness are affected by the ‘cutoff’ value at which the tail starts. (More.)
Finally, we provide a glossary of the key terms we use, such as performance or heavy-tailed.
For more details, see our full write-up.
Acknowledgments
We’d like to thank Owen Cotton-Barratt and Denise Melchin for helpful comments on earlier drafts of our write-up, as well as Aaron Gertler for advice on how to best post this piece on the Forum.
Most of Max’s work on this project was done while he was part of the Research Scholars Programme (RSP) at FHI, and he’s grateful to the RSP management and FHI operations teams for keeping FHI/RSP running, and to Hamish Hobbs and Nora Ammann for support with productivity and accountability.
We’re also grateful to Balsa Delibasic for compiling and formatting the reference list.
[2] Similarly, don’t treat ‘heavy-tailed’ as an asymptotic property – i.e. one that by definition need only hold for values above some arbitrarily large value. Instead, consider the range of values that matter in practice. For instance, a distribution that exhibits heavy tails only for values greater than 10^100 would be heavy-tailed in the asymptotic sense. But for e.g. income in USD values like 10^100 would never show up in practice – if your distribution is supposed to correspond to income in USD you’d only be interested in a much smaller range, say up to 10^10. Note that this advice is in contrast to the standard definition of ‘heavy-tailed’ in mathematical contexts, where it usually is defined as an asymptotic property. Relatedly, a distribution that only takes values in some finite range – e.g. between 0 and 10 billion – is never heavy-tailed in the mathematical-asymptotic sense, but it may well be in the “practical” sense (where you anyway cannot empirically distinguish between a distribution that can take arbitrarily large values and one that is “cut off” beyond some very large maximum).
For performance in “high-complexity” jobs such as attorney or physician, that meta-analysis (Hunter et al. 1990) reports a coefficient of variation that’s about 1.5x as large as for ‘medium-complexity’ jobs. Unfortunately, we can’t calculate how heavy-tailed the performance distribution for high-complexity jobs is: for this we would need to stipulate a particular type of distribution (e.g. normal, log-normal), but Hunter et al. only report that the distribution does not appear to be normal (unlike for the low- and medium-complexity cases).
Claims about a 10x output gap between the best and average programmers are very common, as evident from a Google search for ‘10x developer’. In terms of value rather than quantity of output, the WSJ has reported a Google executive claiming a 300x difference. For a discussion of such claims see, for instance, this blog post by Georgia Institute of Technology professor Mark Guzdial. Similarly, slide 37 of this version of Netflix’s influential ‘culture deck’ claims (without source) that “In creative/inventive work, the best are 10x better than the average”.
Similarly, don’t treat ‘heavy-tailed’ as an asymptotic property – i.e. one that by definition need only hold for values above some arbitrarily large value. Instead, consider the range of values that matter in practice. For instance, a distribution that exhibits heavy tails only for values greater than 10^100 would be heavy-tailed in the asymptotic sense. But for e.g. income in USD values like 10^100 would never show up in practice – if your distribution is supposed to correspond to income in USD you’d only be interested in a much smaller range, say up to 10^10. Note that this advice is in contrast to the standard definition of ‘heavy-tailed’ in mathematical contexts, where it usually is defined as an asymptotic property. Relatedly, a distribution that only takes values in some finite range – e.g. between 0 and 10 billion – is never heavy-tailed in the mathematical-asymptotic sense, but it may well be in the “practical” sense (where you anyway cannot empirically distinguish between a distribution that can take arbitrarily large values and one that is “cut off” beyond some very large maximum).