Why and how to assess expertise

After my previous hippy dippy post, I figured I’d balance the scale with one that is more content-heavy. The following post is about a skill that will be important for many effective altruists to learn: expertise assessment. The thoughts below are from over a hundred hours of things like designing EA Outreach’s hiring process, interviewing lots of job candidates, reviewing 1000+ EA Global applications, and creating the Pareto Fellowship evaluation pipeline.

Why learn how to assess expertise?

Improving the world will involve drawing upon many knowledge and skill domains. Barring development of the skill-learning chair from the Matrix, your capacity is too limited to master all of these domains.

Thus, you must rely upon experts.

However, there are challenges to expert identification. Firstly, there are many who masquerade as experts. This masquerading may be deliberate—e.g., probably most fortune tellers—or due to poor self-evaluation—e.g., many college sophomores. Secondly, there are many experts who are not fully aware of their expertise. This will often be true for “intuitive” experts, e.g., experts at charisma or experts at implicitly modeling other people. Thirdly, there are entire domains in which the base-rate of true expertise in the domain’s phenomena amongst even highly credentialed people is quite low. From both my personal experience in a number of labs, and from coming across lots of findings like this one, I can tell you cognitive science is one such example of a field. This will also be the case in many other social sciences, financial forecasting, consulting, etc. This means that someone’s job title or the letters “PhD” in someone’s email signature are often (usually?) not sufficient markers of expertise. (This is probably the most common trap for those seeking expert consultation.)

Thus, you must learn how to assess expertise.

Here are some circumstances when having expertise assessment tools is incredibly useful:

  • You are asking someone for important information (e.g., as when Open Philanthropy Project consults policy experts for its US Policy investigations)

  • You are figuring out where to donate (e.g., you want to see whether a given organization actually has the ability to do the world-improvement activities that they are intending to do).

  • You are evaluating a potential new hire (e.g., you want to see whether a candidate for a marketing position is in fact good at marketing)

The paradox of expertise assessment

A sociologist friend of mine told me a story about a Chinese emperor named Qin Shi Huang. As the story goes, Qin Shi Huang sought to live forever. So, he recruited a multitude of advisors to find the hidden secret to immortality. Unfortunately, the men he recruited had no ability to assess which magicians or alchemists might be experts in immortality elixirs. Furthermore, the emperor had no ability to assess which advisors might be experts in assessing experts in immortality elixirs. (Note even further that there are no true experts in the relevant domains here. This may remind you of fields such as technological forecasting.) In the end, as these things tended to go, a lot of advisors got executed.

This has a lot in common with Meno’s Paradox. Here is Meno’s paradox rephrased as a paradox of expertise assessment:

  1. If you are not an expert in domain D, it is impossible to independently assess whether someone is an expert in D.

  2. You are not an expert in D.

  3. Therefore, it’s impossible for you to independently assess whether someone is an expert in D.

This argument is probably false. Particularly, premise 1:

  1. If there exist DGM—domain-general markers that tell you whether someone may be an expert—then it is possible to independently assess whether someone is an expert in D, even if you are not an expert in D.

  2. There exist DGM.

  3. Therefore it is possible to independently assess whether someone is an expert in D, even if you are not an expert in D.

To follow I will describe some domain general markers for assessing expertise across domains.

I’ve combined them into couple “back-pocket” methods—ones you can use to evaluate people you meet at conferences or dinners, or for people who you deliberately seek out. They are definitely not exhaustive, but they should cover most cases. I’ve listed the first back-pocket method below. The second back-pocket method I’d rather not spread widely, since it is much more game-able. If you’d be interested in it, feel free to email me at tyler@centreforeffectivealtruism.org. I also have a much more heavy duty spreadsheet-based method if you have a particularly high-stake expertise assessment task (e.g., you’re choosing which global poverty team to join for the next several years, or which animal welfare organization to give an enormous amount of money to.)

Necessary conditions for expertise: the P-I-F-T method

The claim: each of the four criteria below are necessary (but, importantly, not sufficient) conditions for expertise. You may find them to be pretty obvious, but ask yourself: do I explicitly assess for these things while engaging experts? If not, you may run the risk of, e.g., hiring the wrong people or acting on faulty information.

To gain fluency in assessing each of the following conditions, I recommend the following process:

  1. Write down a list of examples where it is critical for you to trust either your own or someone else’s knowledge or skills. These are examples where either you are the expert or you are counting on an expert.

  2. Think of ways in which you can apply the criteria below to the expert.

Furthermore, each of the examples below will test your intuitions on expertise through questions at the end of them. Please try to answer the questions before looking at the answer.

P: Processing of relevant information

Is the person performing detailed mental operations upon relevant data, beyond the ones which non-experts perform? Is this the sort of processing which would plausibly yield expertise in the relevant domain?

Examples

{You want to invite a philosophy expert to speak at your conference.}

Brad and Will encounter an argument. Brad, the analytical philosophy novice, assesses whether it feels intuitively true. Will, the analytical philosophy expert not only assesses whether the argument feels intuitively true, but also assesses its conceptual clarity and hidden premises. Who do you think has the better marker of expertise? Why?

--

Answer: Brad lacks a robust process; Will does not. Invite Will.

--

{You want to learn persuasion.}

Steve and Hilary must persuade Rochelle the French bureaucrat to expedite their visa process. Both Steve and Hilary note that Rochelle is wearing delightful chartreuse earrings. Steve, the persuasion novice, runs the mental process of noting that Rochelle is a person, and that people generally respond well to friendliness. Hilary, the persuasion expert, notes that Rochelle’s chartreuse earrings are the only flamboyant clothing items that anyone in the office is wearing. Based on a large sample size of processed experience now stored in her system one, she guesses that the chartreuse earrings could signify that Rochelle wants to distinguish herself from her fellow bureaucrats—to show that she is not just another bureaucrat. Based on this, Hilary makes a gamble to say things which shows that she appreciates Rochelle’s uniqueness—e.g., “You’re the friendliest-looking embassy employee I’ve ever met!” Who do you think has the better marker of expertise? Why?

--

Steve lacks a robust process; Hilary does not. Learn from Hilary.

--

{You are a foundation program officer deciding who to give a grant to.}

Martha and Leanne are both published neuroscientists who study the lateral geniculate nucleus. You have already read their grant proposals, and they seem to be of comparable quality. Even though you are not an expert in geniculate nuclei, let alone lateral ones, you must choose one person to give a grant to. Thus, you must decide who is the potentially more revolutionary scientist. Both Martha and Leanne seem to have broad knowledge of most published work on the topic. They appear to be equally intelligent. However, Leanne takes the novel approach of gem-mining dynamical models in physics and epidemiology that seem to describe similar phenomena in the lateral geniculate nucleus. She also spends a lot of time devising thought experiments and free-associating around tricky questions. These methods seem a bit unusual, but in your grant-making experience, you’ve found that—all else equal—scientists who use unusual methods tend to produce more innovative work. Who do you think has the better marker of expertise? Why?

--

You bet on Leanne over Martha. In this case, both Martha and Leanne probably have robust processes. However, Martha lacks a process that yields revolutionary expertise; Leanne more likely does not. You probably made the right bet.

--

Ways of assessing

I. Ask questions which will reveal the mental processes of experts, such as:

a. “Before you fix a computer, what’s your general diagnostic process like?” (For a computer repair specialist)

b. “The constructs of confidence and self-efficacy seem very similar. How do you tell the difference between them?” (For an expert in social psychology)

c. “How does the quality of Richard Dawkin’s work compare to that of other evolutionary biologists? Do you have any critiques?” (For an expert in evolutionary biology)

d. “Let’s say I want to stage a magic trick in an extremely crowded room. How would I do that? What about in a room with very loud music?” (For an expert party magician)

II. Find out whether they’ve been part of a job, program, or mentorship what would have equipped them with special mental processes.

I: Interaction with relevant information sources

Is the person regularly interfacing with relevant data? Is this the sort of data that an expert would plausibly engage? Note that merely encountering relevant sources is not sufficient. The expert needs to have paid attention to these sources, as the first example will illustrate.

Examples

{You want to hire a graphic designer.} Bob and Maria are walking down a city street. Bob, a graphic design novice, pays no attention to the signs and advertisements along the side of the street, even though they are within his field of vision. Maria, an expert, pays full attention to these things. She notes the lack of spatial alignment amongst elements in the dry cleaner’s sign. As she passes a Louis Vuitton ad at the bus stop, she ogles the beautiful ball serifs of the Bauer Bodoni bold italic typeface (incidentally, my favorite font!). Color schemes, geometry, and visual flows all jump out at her as objects onto themselves. Who do you think has the better marker of expertise? Why?

--

Bob encounters relevant data, but does not engage it; Maria both encounters and engages relevant data. Hire Maria.

--

{You want to learn how to fundraise.}

Cassandra is a quantitative finance expert but a novice at fundraising. Jake is an expert at fundraising. Jake is constantly immersing himself in fundraising case studies, talking to other experts, and meeting with funders. Cassandra, on the other hand, interfaces with sources like mathematical models of markets. Who do you think has the better marker of expertise? Why?

--

Cassandra does not interface with relevant enough information sources; Jake does do so. Consult Jake.

--

{You want to improve the effectiveness of your team.}

Brian is a Princeton academic who claims to be an expert in team effectiveness. The evidence: He has analyzed 1000 small family businesses and has been published multiple times in Science. Miranda does not claim to be an expert in team effectiveness, but several people have suggested that she might be. The evidence: she is the rare type of venture capitalist who formerly founded a successful startup, ran a large company, and now sits on nonprofits boards and invests in companies of all sizes (and has a winning track record doing so). Who do you think has the better marker of expertise? Why?

--

Unless your organization is a small family business, Brian has probably not interfaced with relevant information sources. Miranda, on the other hand, has engaged a wide variety of organizations. There is a good chance that her ideas about team effectiveness might be higher quality, since she will likely have abstracted organization-general lessons from a more diverse sample. This is a more difficult case than the other two, but if faced with a decision between the two, I would consult Miranda instead of Brian.

--

Ways of assessing

I. Ask questions which will reveal what sorts of information they engage, such as:

a. “Tell me about individual cases in your management experience.” (For a manager you might hire)

b. “What sorts of things do you pay attention to when you’re at an event?” (For an event director)

c. “Roughly how many pieces do you edit in an average month?” (For an editor)

d. “Which papers would you recommend reading to understand the cutting edge in hyperbolic geometry?” (For an expert in hyperbolic geometry)

II. Find out whether they’ve been part of a job, program, or mentorship what would have given them strong samples of relevant information.

III. See how fluently they can generate examples of phenomena in the domain. The more examples they can generate, the better.

F: Feedback with relevant metrics

Does the person have (or have they had) feedback loops that help them accurately calibrate whether they are increasing their expertise or making accurate judgements?

In domains where reality does not give good feedback, they need to have a set of well-honed heuristics or proxy feedback methods to correct for better output if the result is going to be reliably good (this goes for, e.g., philosophy, sociology, long-term prediction). In domains where reality can give good feedback, they don’t necessarily need well-honed heuristics or proxy feedback methods (e.g., massage, auto repair, swordfighting, etc.). All else equal, superior feedback loops have the following attributes (idealized versions below):

  • Speed (you learn about discrepancies between current and desired output quickly after taking an action so you can course-correct)

  • Frequency (the feedback loop happens frequently, giving you more samples to calibrate on)

  • Validity (the feedback loop is helping you get closer to the output you actually care about)

  • Reliability (the feedback loop consistently returns similar discrepancies in response to you taking similar actions)

  • Detail (the feedback loop gives you a large amount of information about the difference between current and desired output)

  • Saliency (the feedback loop delivers attentionally or motivationally salient feedback)

Examples

{You want to predict technology timelines} Julie and Kate both claim to be experts in technological forecasting. When you ask Julie how she calibrate her predictions, she replies, “Mainly, I just have sense for these sorts of things. But I also do things like monitor Google Trends, read lots of articles on technology, and ask lots of people what they think will happen. I’ve been doing this for 20 years.” She then points to a number of successful predictions she’s made. When you ask the same question to Kate, she replies, “Well, in the short term, it’s been shown that linear models of technological progress are the best, so I tend to use those to calibrate on the timespan of 1-3 years. If I make longer term predictions, I try to tell as many stories as possible for how those predictions may be false. Then I try to make careful arguments that rule out these stories. Furthermore, I always check whether my predictions diverge substantially from other technological forecasters. If they do, I try to figure out why. I’ve also identified a number of technological forecasters who have consistently good track records, and I study their methods, evidence, and predictions carefully. Finally, whenever one of my predictions turn out to be false, I spend about a week figuring out whether there is any general principle to be learned to guard against being wrong in the future.” Who do you think has the better marker of expertise? Why?

--

Technological forecasting is a domain in which reality doesn’t provide strong feedback, so you need proxy feedback. Julie does not have good proxy feedback while Kate does have relatively decent proxy feedback methods. Barring special information about Julie, Kate’s predictions are likely to be more reliable, all else equal.

--

{You want to choose a piano teacher}

Both Ned and Megan are piano teachers. Of the two, Ned is a much better pianist, having won many awards and played at Carnegie Hall many times. You ask both Ned and Megan how they can tell whether their teaching is working for a given student. Ned replies that he simply looks at the outcomes: if a student practices under him for several years, they become much better. “Basically, I show them how to play scales and pieces well, and then I check in about once every other week to make sure they are practices the drills I showed them.” Megan replies with a detailed set of ways she can note rate of progress and how she adjusts her teaching accordingly. “For example, I know whether a student has ‘chunked’ a given chord through the following method: I stand behind the piano and quickly turn around a piece of paper with a chord on it and time how many milliseconds it takes for a student to react and play the chord. Also, I each week I ask them to honestly report on whether they feel as if the chord is still a series of notes or whether it feels more like ‘one note.’ This indicates that the chord has become a ‘gestalt’ in the students mind. Another example: whenever a student makes an error while playing a piece, I mark the corresponding area in the sheet music. Eventually, I can then tell what types of errors a student generally makes by analyzing the darkest areas on various pieces—the places with the most pen marks.” Megan continues to tell you similar examples. Who do you think has the better marker of expertise? Why?

--

In this case, while Ned may be the better pianist, he may not be the relative expert at teaching piano. It would seem he lacks relevant feedback loops to tell him whether he is successful at teaching. While he notes that his students improve over time, he is not entertaining the possibility that they may have improved counterfactually over time without his intervention.

--

{You want to hire a manager}

Both Todd and Greg have applied for a manager position at your organization. You ask each of them about their process for monitoring the rate at which their teams are making progress on goals. Todd: “I have everyone on a system where I can monitor the amount of Pomodoros each person is completing. If certain team members are lagging behind in their amount of Pomodoros, I give them a pep talk, after which the amount tends to go back up.” Greg: “I have each team member set daily subgoals. Then I look at two things: (a) whether these subgoals tend to align to the broader goals and (b) whether they are achieving the subgoals they set for themselves. If a team member is lagging behind in (a) or (b), I give them a pep talk, after which they tend to perform better.”

--

In this case, both Todd and Greg have decent feedback loops. However, Todd’s feedback loop is more likely to fall victim to Goodhart’s law. In other words, though his method might be high in reliability, the measurePomodoro-maximization might accidentally become the target, even though the intended target is goal completion. Greg’s feedback loop is higher in validity, in that it measures the target he actually cares about more tightly.

--

Ways of assessing

I. Ask questions which will reveal the details of their feedback loops (and whether they have them), such as:

a. “Let’s say I’m already a proficient coder, but I want to learn how to code at the level of a master. What sorts of problems might I practice on to move from proficiency to mastery? Are there any textbooks I should read?” (For a software engineer)

b. “In what ways do people typically stumble when they try to improve at data analysis?” (For a data analyst)

c. “How do tell whether a marketing campaign is working?” (For a professional marketer)

d. “Can you tell me a bit about how you learn?”

II. Find out whether they’ve been part of a job, program, or mentorship what would have given them strong feedback loops.

III. Sometimes, people with tacit expertise will not be able to articulate their feedback loops. Analyze whether reality provides robust feedback in their domain. For example, a bike-rider might not be able to describe the feedback loops through which they learned bike-riding. However, reality automatically provides feedback in the domain by causing novice bike-riders to fall over, until they accumulate enough procedural knowledge to balance on two wheels.

T: Time spent on the above

This one is the most straightforward of all the necessary conditions for expertise. (Thus, I won’t go into much detail.) Simply: An expert needs to have spent enough time processing and interacting with the relevant data with robust feedback loops.

Ask: Has this expert put a plausibly sufficient amount of time into learning or using the skill in order to gain expertise?

For some skills, like using a spoon, there is a short latency between beginnerhood and expertise. For others, like having well-calibrated political views, there is quite a long latency. Accordingly, you can probably trust the average claim about spoon-use and should be suspicious of the average claim about politics.


There you have it: the PIFT method for assessing basic conditions for expertise. Here is what the underlying model looks like:

The PIL model

One easy way to remember the method: “If the person claims to be an expert and is not, say, ‘pift!’”

If it seems that I have misidentified or failed to identify a necessary condition for expertise, please let me know!