Takeaways from EAF’s Hiring Round

We (the Effec­tive Altru­ism Foun­da­tion) re­cently ran a hiring round for two po­si­tions, an Oper­a­tions An­a­lyst and a Re­search An­a­lyst, for which pur­pose we over­hauled our ap­pli­ca­tion pro­cess. Since we’re happy with the out­come our­selves and re­ceived pos­i­tive feed­back from ap­pli­cants about how we han­dled the hiring round, we want to share our ex­pe­rience. This post might be use­ful for other or­ga­ni­za­tions and (fu­ture) ap­pli­cants in the com­mu­nity.

Summary

  • The most im­por­tant chan­nels for find­ing promis­ing can­di­dates were our own net­works and the EA com­mu­nity.

  • En­courag­ing spe­cific in­di­vi­d­u­als to ap­ply was cru­cial.

  • Hav­ing an ap­pli­ca­tion pe­riod of only two weeks is pos­si­ble, but not ideal..

  • We’ve be­come more con­vinced that a gen­eral men­tal abil­ity test, work tests rele­vant for the role, a trial (week), and refer­ence checks should be im­por­tant parts of our ap­pli­ca­tion pro­cess.

  • The ap­pli­ca­tion form was more use­ful than a cover let­ter as a first filter. In the fu­ture, we will likely use a very brief form con­sist­ing of three to five ques­tions.

  • It’s less clear to us that in­ter­views added a lot of value, but that might have been due to the way we con­ducted them.

  • It’s hard to say whether blind­ing sub­mis­sions and us­ing quan­ti­ta­tive mea­sures in fact led to a fairer pro­cess. They definitely in­creased the fo­cus and effi­ciency of our dis­cus­sions. Given that it didn’t cost us much ex­tra time, we will likely con­tinue this prac­tice in the fu­ture.

Recom­men­da­tions for applicants

  • If you’re un­cer­tain about ap­ply­ing, ap­proach the or­ga­ni­za­tion and find out more.

  • When in doubt, ap­ply! We had to con­vince two of the four can­di­dates who made it to the trial to ap­ply in the first place.

  • Not all effec­tive al­tru­ism or­ga­ni­za­tions are al­ike. Fa­mil­iarize your­self as much as pos­si­ble with the mis­sion, pri­ori­ties, and gen­eral prac­tices of the or­ga­ni­za­tion you’re ap­ply­ing to.

  • Re­search po­si­tions: Prac­tice your skills by re­search­ing rele­vant top­ics in­de­pen­dently. Pub­lish your write-ups in a rele­vant chan­nel (e.g. on the EA fo­rum or your per­sonal blog). Col­lect as much rad­i­cally hon­est feed­back as pos­si­ble and try to in­cor­po­rate it.

  • Oper­a­tions po­si­tions: Read the 80,000 Hours post on op­er­a­tions roles. Take the lead on one or more pro­jects (e.g. web­site, event). Col­lect feed­back and try to in­cor­po­rate it.

De­cid­ing whether to hire

When it comes to hiring for an early-stage non­profit like ours, com­mon startup ad­vice likely ap­plies.[1] This is also backed up to some ex­tent by our own ex­pe­rience:

  • Hire slowly, and only when you ab­solutely need to.

  • If you don’t find a great can­di­date, don’t hire.

  • Com­pro­mis­ing on these points will mean achiev­ing less due to the added costs in man­age­ment ca­pac­ity.

In this par­tic­u­lar case, we re­al­ized that our philan­thropy strat­egy lacked a grant­mak­ing re­search team, which im­plied a fairly clear need for an ad­di­tional full-time equiv­a­lent. We also sys­tem­at­i­cally an­a­lyzed the amount of im­por­tant op­er­a­tions tasks that we weren’t get­ting done, and con­cluded that we have about 1.5 full time equiv­a­lents worth of ex­tra work in op­er­a­tions.

Goals of the ap­pli­ca­tion process

Find the most suited can­di­date(s) for each role.

  • Make sure that peo­ple who are plau­si­bly among the best can­di­dates ac­tu­ally ap­ply.

  • Make sure that your filters test for the right char­ac­ter­is­tics.

Make sure that ap­pli­cants have a good ex­pe­rience and can take away use­ful les­sons.

  • Min­i­mize un­nec­es­sary time in­vest­ment while learn­ing as much as pos­si­ble about each other.

  • Com­mu­ni­cate de­tails about the ap­pli­ca­tion pro­cess as clearly and early as pos­si­ble.

  • Com­pen­sate can­di­dates who in­vest a lot of time, ir­re­spec­tive of whether they get the job or not.

  • Give feed­back to re­jected can­di­dates if they ask.

  • Write this post.

The ap­pli­ca­tion pro­cess in brief

Our ap­pli­ca­tion pro­cess con­sisted of four stages. Th­ese were, in chronolog­i­cal or­der:

  1. ini­tial ap­pli­ca­tion (ap­pli­ca­tion form, gen­eral men­tal abil­ity (GMA) test, CV);

  2. work test(s);

  3. two 45 minute in­ter­views;

  4. trial week and refer­ence checks.

Outcome

Within a two-week ap­pli­ca­tion pe­riod, we re­ceived a to­tal of 66 ap­pli­ca­tions. Six weeks later, we offered po­si­tions to two can­di­dates and offered an ex­tended trial to an­other can­di­date. The ap­pli­ca­tion pro­cess left us with a very good un­der­stand­ing of the can­di­dates’ strengths and weak­nesses and a con­vic­tion that they’re both ex­cel­lent fits for their re­spec­tive roles. Through­out the pro­cess, we stuck to a very ag­gres­sive timetable and were able to move quickly, while still gath­er­ing a lot of use­ful in­for­ma­tion.

The en­tire pro­cess took about two weeks of prepa­ra­tion and about two months to run (in­clud­ing the two week ap­pli­ca­tion pe­riod). We in­vested about 400 hours, mainly across three staff mem­bers, which amounts to about 30% of their time dur­ing that pe­riod.[2] This was a good use of our time, both in ex­pec­ta­tion and in ret­ro­spect. If your team has less time available to run an ap­pli­ca­tion pro­cess, we would prob­a­bly recom­mend do­ing fewer in­ter­views.

Table: Overview of the differ­ent ap­pli­ca­tion stages

We did not ex­pect to filter so strongly dur­ing the first stage, but the me­dian can­di­date turned out to be a worse fit than we ex­pected. The best ap­pli­ca­tions were bet­ter than we ex­pected, how­ever. The fact that the gen­der ra­tio stayed fairly con­stant through­out the stages is some ev­i­dence that we were suc­cess­ful in curb­ing po­ten­tial gen­der bi­ases (see “Equal op­por­tu­nity” sec­tion). How­ever, it’s hard to say since we don’t have a con­trol group. Since the time cost was limited, we will likely keep these mea­sures in place. In the fu­ture, we’d like to re­duce the time can­di­dates have to spend on the ini­tial ap­pli­ca­tion in or­der to en­courage more can­di­dates to ap­ply and avoid un­nec­es­sar­ily tak­ing up ap­pli­cants’ time (also see sec­tion “First stage: Ini­tial ap­pli­ca­tion”).

How we came up with the ap­pli­ca­tion process

Scien­tific literature

We mainly re­lied on “The Val­idity and Utility of Selec­tion Meth­ods in Per­son­nel Psy­chol­ogy: Prac­ti­cal and The­o­ret­i­cal Im­pli­ca­tions of 100 Years” by Sch­midt (speci­fi­cally, the 2016 up­date to his 1998 meta-anal­y­sis on this is­sue). We also con­sid­ered Hand­book of Prin­ci­ples of Or­ga­ni­za­tional Be­hav­ior: Indis­pens­able Knowl­edge for Ev­i­dence-Based Man­age­ment by Ed­win Locke. Both seemed to agree that gen­eral men­tal abil­ity (GMA) tests were the best pre­dic­tors of job perfor­mance. Indi­ca­tors of con­scien­tious­ness (ei­ther in­tegrity tests or con­scien­tious­ness mea­sures) also fared well, as did in­ter­views (ei­ther struc­tured or un­struc­tured). In­ter­est­ingly and sur­pris­ingly, work tests were the best pre­dic­tor in the 1998 meta-anal­y­sis, but did far worse in the 2016 one. Over­all, this led us to in­clude a GMA test in the pro­cess, in­creased our con­fi­dence in con­duct­ing in­ter­views, de­creased our con­fi­dence in work tests, and led us to try to mea­sure con­scien­tious­ness (in a way that was not eas­ily game­able).

Other EA organizations

We also con­sid­ered how other EA or­ga­ni­za­tions han­dled this challenge, par­tic­u­larly Open Phil and FHI, as they seem most similar to us in some rele­vant ways. From what we knew, they re­lied strongly on work tests and a trial (in the case of Open Phil), which up­dated us to­ward in­clud­ing a work test and run­ning some form of trial.

Our own reasoning

Think­ing about our par­tic­u­lar situ­a­tion and past ex­pe­riences, as well as com­mon sense heuris­tics, made a few things stand out:

  • Hav­ing a thor­ough un­der­stand­ing of our mis­sion is cru­cial for pri­ori­tiz­ing cor­rectly, mak­ing trade-offs, and com­ing up with use­ful ideas. We thought it was im­por­tant that can­di­dates dis­play some of this already.

  • Although work tests are not among the best in­stru­ments in the meta anal­y­sis, whether can­di­dates ex­cel at tasks they would take on if we hired them is likely pre­dic­tive of their fu­ture perfor­mance in that role. The difficulty seems to be in mak­ing work tests rep­re­sen­ta­tive of fu­ture tasks. Still, this gave us fur­ther rea­son to in­clude a work test fairly early and also to do an in-per­son trial week.

  • For small teams, cul­ture fit also struck us as par­tic­u­larly im­por­tant. You have to get along in or­der to perform well and be as effi­cient as pos­si­ble. Small teams in par­tic­u­lar have limited slack for tol­er­at­ing fric­tion be­tween team mem­bers, as we’ve also ex­pe­rienced in the past. This con­sid­er­a­tion spoke in fa­vor of in­ter­views and an in-per­son trial.

Equal opportunity

We wanted to make the pro­cess as fair as pos­si­ble to all ap­pli­cants. Re­search sug­gests that cer­tain groups in par­tic­u­lar, e.g. women and eth­nic minori­ties, are less likely to ap­ply and are more likely to be filtered out for ir­rele­vant rea­sons. We tried to cor­rect for po­ten­tial bi­ases by

  • en­courag­ing spe­cific in­di­vi­d­u­als to ap­ply,

  • mak­ing sure the job ad was in­clu­sive,

  • defin­ing crite­ria and good an­swers in ad­vance,

  • blind­ing sub­mis­sions wher­ever pos­si­ble [3], and

  • in­tro­duc­ing quan­ti­ta­tive mea­sures wher­ever pos­si­ble.

We thought quan­ti­ta­tive mea­sures were par­tic­u­larly rele­vant for squishy com­po­nents like “cul­ture fit” which might in­tro­duce bi­ases late in the pro­cess when blind­ing is no longer pos­si­ble.

There is an­other rea­son for such mea­sures in a small com­mu­nity like effec­tive al­tru­ism. It was very likely that we would eval­u­ate peo­ple who we already knew and had formed opinions on—both pos­i­tive and nega­tive—that would af­fect our judg­ment in ways that would be very hard to ac­count for.

The ap­pli­ca­tion pro­cess in detail

Writ­ing the job ad

When draft­ing the two job ads (Oper­a­tions An­a­lyst, Re­search An­a­lyst) we tried to in­cor­po­rate stan­dard ad­vice for en­sur­ing a di­verse ap­pli­cant pool. We ran the writ­ing through a “gen­der de­coder” (tex­tio.com, gen­der-de­coder), cut down the job re­quire­ments to what we con­sid­ered the min­i­mum, and added an equal op­por­tu­nity state­ment. For bet­ter plan­ning, in­creased ac­countabil­ity, and as a com­mit­ment de­vice, we made the en­tire pro­cess trans­par­ent and gave as de­tailed a timeline as pos­si­ble.

What we learned [4]

  • We would have liked to ex­tend the ap­pli­ca­tion pe­riod from 2 to 8 weeks. This was not pos­si­ble due to other dead­lines at the or­ga­ni­za­tional level.

  • We should have re­duced bar­ri­ers to ap­proach us for peo­ple who were un­cer­tain whether to ap­ply. In the fu­ture we might add an ex­plicit en­courage­ment to the job ad or host an AMA ses­sion in some form dur­ing the ap­pli­ca­tion pe­riod.

  • After some ini­tial feed­back we added an ex­pected salary range to the job ads and will do so from the be­gin­ning in the fu­ture.

  • We should have com­mu­ni­cated the re­quired time in­vest­ment for the work tests more clearly in the job ad, which, as it turns out, con­tained some am­bigu­ous for­mu­la­tions which im­plied that the work test would take four days to com­plete in­stead of two hours.

  • We have be­come less con­vinced of generic equal op­por­tu­nity state­ments.

Spread­ing the word

We shared the job ads mostly within the EA com­mu­nity and a few other face­book groups, e.g. schol­ar­ship re­cip­i­ents in Ger­many. To make sure the best can­di­dates ap­plied, we reached out to in­di­vi­d­u­als we thought might be a par­tic­u­larly good fit and en­couraged them to ap­ply. For that, we re­lied on our own per­sonal net­works, as well as the net­works of CEA and 80,000 Hours, both of which were a great help. We speci­fi­cally tried to iden­tify peo­ple from groups cur­rently un­der­rep­re­sented at EAF.

What we learned

Table: Share of chan­nels through which ap­pli­cants learned about the po­si­tions, across stages of the ap­pli­ca­tion process

The most im­por­tant chan­nels for find­ing promis­ing can­di­dates was reach­ing out to peo­ple in our own net­work, CEA’s and 80,000 Hours’ net­works, and the on­line EA com­mu­nity.

First stage: Ini­tial application

Ini­tially, we asked can­di­dates for ei­ther role to (1) sub­mit their CV, (2) fill out an ap­pli­ca­tion form, and (3) com­plete a gen­eral men­tal abil­ity (GMA) test [5]. The pur­pose of this stage was to test their ba­sic mis­sion un­der­stand­ing and ba­sic gen­eral com­pe­tence.

The ap­pli­ca­tion form con­sisted of 13 long-form ques­tions (with a limit of 1,000 char­ac­ters per ques­tion). Three of these asked for un­der­stand­ing of and iden­ti­fi­ca­tion with our mis­sion, while ten were so-called “be­hav­ioral ques­tions”, i.e. ques­tions about past be­hav­ior and perfor­mance (see Ap­pendix B for all ques­tions). We also col­lected all nec­es­sary lo­gis­ti­cal info. To some ex­tent, this form was a sub­sti­tute for a stan­dard­ized in­ter­view to re­duce our time in­vest­ment.

The GMA test was Raven’s Ad­vanced Pro­gres­sive Ma­tri­ces (ad­ministered by Pear­son Clini­cal). We picked it be­cause it’s fairly well stud­ied and un­der­stood, has a high g-load­ing (.80), only takes about 40 min­utes to com­plete, and con­tains no com­po­nents that could in­tro­duce lin­guis­tic or cul­tural bi­ases. It does have the draw­back that it only mea­sures one form of in­tel­li­gence; how­ever, we deemed this ac­cept­able for our pur­poses, given the high g-load­ing and short test time.

Eval­u­a­tion. Two peo­ple from our team scored each form an­swer from 0 to 5 and each CV from −1 to +1. To elimi­nate po­ten­tial bi­ases, we blinded the sub­mis­sions, scored ques­tion by ques­tion in­stead of can­di­date by can­di­date, and ran­dom­ized the can­di­date or­der­ing af­ter each ques­tion. We then ag­gre­gated the scores from all three com­po­nents (CV, form, test) af­ter con­vert­ing them to a stan­dard score. Based on the re­sult­ing dis­tri­bu­tion and look­ing at edge cases in more de­tail, we de­cided which of the ap­pli­cants to ad­vance to the next stage. We only un­blinded the data af­ter hav­ing made an ini­tial de­ci­sion. After un­blind­ing, we only al­lowed our­selves to make ad­just­ments up­wards, i.e. only al­low more (rather than fewer) can­di­dates to ad­vance to the next stage. In the end, we didn’t make any changes af­ter un­blind­ing.

What we learned

  • The ini­tial ap­pli­ca­tion form was too long. We had origi­nally planned a shorter ver­sion for the ini­tial stage and a longer one for a later stage. How­ever, we de­cided to com­bine them be­cause we faced some very tight dead­lines at the or­ga­ni­za­tional level. Given the cir­cum­stances, we prob­a­bly made the right call, but would change it if we had more time.

  • The ap­pli­ca­tion form and the GMA test seemed to mea­sure in­de­pen­dent as­pects of a can­di­date (see Ap­pendix A). We will likely con­tinue to use some ver­sion of both in the fu­ture.

  • The GMA test seemed to be a bet­ter pre­dic­tor of good perfor­mance in later stages than the ap­pli­ca­tion form. The form seemed par­tic­u­larly noisy for re­search an­a­lysts. How­ever, the sam­ple size is fairly low and there could be thresh­old effects (see Ap­pendix A).

  • The most use­ful ques­tions from the form were the fol­low­ing (see Ap­pendix A):

    • Ques­tion 1: What is your cur­rent plan for im­prov­ing the world?

    • Ques­tion 12: What do you think is the like­li­hood that at least one minister/​sec­re­tary or head of state will men­tion the term “effec­tive al­tru­ism” in at least one pub­lic state­ment over the next five years? Please give an es­ti­mate with­out con­sult­ing out­side sources. How did you ar­rive at the es­ti­mate?

    • Ques­tion 13: Please de­scribe a time when you changed some­thing about your be­hav­ior as a re­sult of feed­back that you re­ceived from oth­ers.

  • Minor is­sues:

    • We should blind all ma­te­rial our­selves in­stead of ask­ing ap­pli­cants to do so.

    • We should tell ap­pli­cants that the GMA test be­comes pro­gres­sively more difficult.

Se­cond stage: Work test(s)

The work test(s) were de­signed to test the spe­cific skills re­quired for each role. Can­di­dates who com­pleted this stage re­ceived a mon­e­tary com­pen­sa­tion.

Oper­a­tions An­a­lyst. Can­di­dates had to com­plete an as­sign­ment with seven sub­tasks within 120 min­utes. In or­der to val­i­date the test, un­cover po­ten­tial prob­lems, and set a bench­mark, one team mem­ber in a similar role com­pleted the test in ad­vance. We would have liked to test it more, but this did not turn out not to be a prob­lem as far as we can tell.

Re­search An­a­lyst. Can­di­dates had to com­plete two as­sign­ments; one was timed to 120 min­utes, the other one was un­timed, but we sug­gested how much to write and how much time to spend on each task. We de­cided to have a timed and an un­timed test be­cause we saw benefits and draw­backs for both and wanted to gather more in­for­ma­tion. For the un­timed as­sign­ment, we tried to make sure that ad­di­tional time in­vested would have strongly diminish­ing re­turns af­ter two hours by re­duc­ing the to­tal amount of work re­quired. We asked two ex­ter­nal peo­ple to test both as­sign­ments in ad­vance.

Eval­u­a­tion. The can­di­dates’ po­ten­tial su­per­vi­sor eval­u­ated each sub­mis­sion (blinded to the can­di­date’s iden­tity) by scor­ing sub­tasks hori­zon­tally. Some­body else eval­u­ated each sub­mis­sion as a whole so that we could spot po­ten­tial flaws in the eval­u­a­tion. We calcu­lated a weighted sum of the scores from stages 1 and 2, which in­formed our de­ci­sion who to ad­vance.

What we learned

  • The work test(s) gave us a lot of use­ful in­for­ma­tion above and be­yond what we had already gath­ered dur­ing the first stage. A cost-effec­tive way to gain (part of) this in­for­ma­tion sooner could be to ask fu­ture ap­pli­cants for a work/​re­search sam­ple (from a pre­vi­ous pro­ject) as part of the ini­tial ap­pli­ca­tion.

  • We’re lean­ing to­ward only in­clud­ing timed tests in the fu­ture, as we did not see much ev­i­dence of ter­rible sub­mis­sions as a re­sult of time pres­sure—which was our main con­cern with timed tests, es­pe­cially for the re­search an­a­lyst po­si­tion.

  • We re­al­ized that the task load only al­lowed for satis­fic­ing (as op­posed to ex­cel­lence). How­ever, it’s also valuable in ret­ro­spect to test can­di­dates’ abil­ity to op­ti­mize, so we’re con­tem­plat­ing re­duc­ing the task load on one of the tests.

  • In some cases, we should have put more effort into clar­ify­ing ex­pec­ta­tions, i.e. re­quire­ments for a very good sub­mis­sion. This was difficult to an­ti­ci­pate and more test­ing might have im­proved the work tests in this re­gard.

Third stage: Interviews

We wanted to use the in­ter­views to re­solve any un­cer­tain­ties about the can­di­dates we had at that point, to re­fine our as­sess­ment to what ex­tent they iden­ti­fied with our mis­sion, to learn more about their fit for the spe­cific role, and to get a first sense of their fit with our team. We de­cided to have two 45 minute struc­tured in­ter­views per can­di­date back to back: one fo­cused on EAF as an or­ga­ni­za­tion, one fo­cused on the spe­cific role.

Each in­ter­viewer scored the can­di­date across mul­ti­ple crite­ria, and we calcu­lated an ag­gre­gate score. After­wards we dis­cussed all re­main­ing can­di­dates ex­ten­sively be­fore mak­ing a de­ci­sion. Given the tight sched­ule it was not pos­si­ble to make sure that ev­ery­body on the hiring com­mit­tee had the ex­act same in­for­ma­tion, as would be best prac­tice. How­ever, to re­solve some un­cer­tain­ties we had peo­ple listen to some of the in­ter­view record­ings (which we had made with con­sent of the can­di­dates and deleted right af­ter).

What we learned

  • In ret­ro­spect we think the in­ter­views did not give us a lot of ad­di­tional in­for­ma­tion. We had already cov­ered ba­sic be­hav­ioral ques­tions and mis­sion un­der­stand­ing in the ap­pli­ca­tion form and tested for role fit with the work test. This might be less of a con­cern if we only use a trimmed down ver­sion of the ap­pli­ca­tion form.

  • How­ever, we’re also con­tem­plat­ing chang­ing the for­mat of the in­ter­view en­tirely to test com­po­nents which other for­mats can­not cap­ture well, such as in­de­pen­dent think­ing, pro­duc­tive com­mu­ni­ca­tion style, and real-time rea­son­ing.

Fourth stage: Trial week

We in­vited all can­di­dates who made it past the in­ter­views to a trial week in our Ber­lin offices (one per­son com­pleted the trial from a re­mote lo­ca­tion [6]). We wanted to use that time to test what it was like to ac­tu­ally work with them, get more data on their work out­put, test their fit with the en­tire team, and also al­low them to get to know us bet­ter. They worked with their po­ten­tial su­per­vi­sor on two re­search pro­jects (Re­search An­a­lyst) or five op­er­a­tions pro­jects (Oper­a­tions An­a­lyst). We also sched­uled in­di­vi­d­ual meet­ings with each staff mem­ber through­out the week. Dur­ing the same time, we con­tacted refer­ences to learn more about the can­di­dates.

We made a fi­nal de­ci­sion af­ter ex­ten­sive de­liber­a­tions in the fol­low­ing week and dis­cussing each can­di­date in a lot of de­tail. We also asked all staff mem­bers about their im­pres­sions and thoughts, and calcu­lated a fi­nal score which cor­rob­o­rated a qual­i­ta­tive as­sess­ment.

What we learned

  • Over­all, the trial week was very use­ful and we ac­com­plished what we had planned.

  • The refer­ence checks were very helpful and we’re con­sid­er­ing con­duct­ing them ear­lier in the pro­cess. (see Ap­pendix C for our refer­ence call check­list)

  • Schedul­ing 1-on-1 meet­ings with all team mem­bers was a good idea for an or­ga­ni­za­tion of our size. In the fu­ture, we’ll try to batch them on day 2 and 3 to get ev­ery­body ac­quainted more quickly.

  • Some tasks were un­der­speci­fied, and we should have in­vested more time to clar­ify con­text and ex­pec­ta­tions. We could have spot­ted some of these prob­lems ear­lier if we had or­ga­nized even closer su­per­vi­sion in some cases. This proved par­tic­u­larly difficult for the re­mote trial.

  • We’re also con­sid­er­ing a longer trial to bet­ter rep­re­sent ac­tual work con­di­tions. This is a difficult trade-off both for us as an or­ga­ni­za­tion as well as for the ap­pli­cant. One solu­tion could be to ask each can­di­date whether they want to com­plete a one week or a four week trial.

Ap­pendix A. Shal­low data analysis

We ran cor­re­la­tions of the scores from the first three stages which you can ac­cess here. You can find the ques­tions that we used in the form on the far right of the spread­sheet or in Ap­pendix B. After the first stage, we had to split scores be­tween ap­pli­cants for the Oper­a­tions An­a­lyst po­si­tion and the Re­search An­a­lyst po­si­tion. Due to the small sam­ple size, we’re not con­fi­dent in any con­clu­sions and don’t dis­cuss any cor­re­la­tions be­yond the sec­ond stage. We should also note that we can­not tell how well any of the mea­sures cor­re­late with ac­tual work perfor­mance.

First stage

  • A high score on ques­tions 1, 2, 12, and 3 each pre­dict the to­tal form score fairly well (cor­re­la­tions of .77, .69, .69, and .62). (Note that ques­tions 1, 2, and 3 were weighted slightly more than the other ques­tions in the to­tal form score.) No ques­tion was nega­tively cor­re­lated with the form score, which is a good sign. (Note that they also con­tribute to the to­tal form score which likely over­es­ti­mates the cor­re­la­tion.)

  • Ques­tion 12 was not only a good a good pre­dic­tor of the form score, but also had the high­est cor­re­la­tion out of all ques­tions with both GMA score (.43) and CV score (.39).

  • The cor­re­la­tion be­tween the form score and the GMA score was fairly mod­est with .28. which in­di­cates that these in­deed mea­sure differ­ent as­pects. The same holds for the cor­re­la­tions be­tween form score and CV score (.33) as well as GMA score and CV score (.29).

  • Ques­tion 12 was also highly cor­re­lated with the to­tal score of the first stage (.73), even though it was just one out of 13 ques­tions mak­ing up the form score (and was even among the lower weight ques­tions). Other highly cor­re­lated ques­tions were again 1 (.63), 2 (.49), and 13 (.49).

  • Ques­tion 8 was nega­tively cor­re­lated with both GMA score (-.34) and to­tal score (-.14).

Se­cond stage

Re­search An­a­lyst:

  • Do­ing well dur­ing the first stage did not cor­re­late with a high score in the work test (-.06). This is good on one hand be­cause we did in­tend to mea­sure differ­ent things. On the other hand, we wanted to se­lect for ex­cel­lent perfor­mance on the job in the first stage and this didn’t show in the work test at all. We’re not sure if this is pick­ing up on some­thing real since the sam­ple size is low and there could be thresh­old effects, but we would have hoped for a slightly higher cor­re­la­tion all things con­sid­ered.

  • We ob­served some­what strong pos­i­tive cor­re­la­tions with the work test for ques­tions 5 (.36) and 11 (.39) as well as the GMA score (.39). Nega­tive cor­re­la­tions held for ques­tions 2 (-.53), 6 (-.74), 8 (-.42), and the to­tal form score (-.51).

Oper­a­tions An­a­lyst:

  • For this role, the to­tal score from the first stage was well cor­re­lated with the score on the work test (.57).

  • Not only was the GMA cor­re­lated (.58) as was the case for the re­search an­a­lysts, but also the form score (.59) and al­most all sub­ques­tions. Notable nega­tive cor­re­la­tions only held for ques­tion 4 (-.57) and ques­tion 7 (-.33).

Ap­pendix B. Ap­pli­ca­tion form

  • Ques­tion 1: What is your cur­rent plan for im­prov­ing the world?

  • Ques­tion 2: If you were given one mil­lion dol­lars, what would you do with them? Why?

  • Ques­tion 3: What ex­cites you about our mis­sion?

  • Ques­tion 4: Briefly de­scribe an im­por­tant ac­com­plish­ment you achieved (mainly) on your own, and one you achieved to­gether with a team. For the lat­ter, what was your role?

  • Ques­tion 5: Briefly de­scribe a time when you had to finish a task you did not en­joy. How did you man­age the situ­a­tion? What was the re­sult?

  • Ques­tion 6: Briefly give two ex­am­ples where you de­liber­ately im­proved your­self.

  • Ques­tion 7: Can you de­scribe an in­stance when you de­liber­ately took the ini­ti­a­tive to im­prove the team/​or­ga­ni­za­tion you were work­ing with?

  • Ques­tion 8: There are times when we work with­out close su­per­vi­sion or sup­port to get the job done. Please tell us about a time when you found your­self in such a situ­a­tion and how things turned out.

  • Ques­tion 9: Please tell us about a time when you suc­cess­fully de­vised a clever solu­tion to a difficult prob­lem or set of prob­lems, or oth­er­wise “hacked” your way to suc­cess.

  • Ques­tion 10: While we try to make sure that work­load and stress are kept at sus­tain­able lev­els, there are times when hav­ing to perform un­der pres­sure at work is un­avoid­able. How do you cope with stress­ful situ­a­tions?

  • Ques­tion 11: What was a high-stakes choice you faced re­cently? How did you ar­rive at a de­ci­sion?

  • Ques­tion 12: What do you think is the like­li­hood that at least one minister/​sec­re­tary or head of state will men­tion the term “effec­tive al­tru­ism” in at least one pub­lic state­ment over the next five years? Please give an es­ti­mate with­out con­sult­ing out­side sources. How did you ar­rive at the es­ti­mate?

  • Ques­tion 13: Please de­scribe a time when you changed some­thing about your be­hav­ior as a re­sult of feed­back that you re­ceived from oth­ers.

Ap­pendix C. Tem­plate for refer­ence checks

  • In­tro­duce your­self.

  • Give some back­ground on your call.

    • [can­di­date] ap­plied for [po­si­tion] with us. Cur­rently in [stage of the ap­pli­ca­tion pro­cess].

    • [de­scrip­tion of the types of work they’d be do­ing in the role]

    • We will treat all in­for­ma­tion con­fi­den­tially.

  • In what con­text did you work with the per­son?

    • What was the con­crete pro­ject/​task?

  • What were their biggest strengths?

  • What were their main ar­eas for im­prove­ment back then? Was there any­thing they strug­gled with?

  • How would you rate their over­all perfor­mance in that job on a 1-10 scale? Why?

  • Would you say they are among the top 5% of peo­ple who you ever worked with at your or­ga­ni­za­tion?

  • If you were me, would you have any con­cerns about hiring them?

  • Is there any­thing else you would like to add?

  • If I have fur­ther ques­tions that come up dur­ing the trial week, may I get in touch again?

Endnotes

[1] Kerry Vaughan pointed out to us af­ter­wards that there are also rea­sons to think the com­mon startup ad­vice might not fully ap­ply to EA or­ga­ni­za­tions: 1) The typ­i­cal startup might be more fund­ing-con­strained than EAF, and 2) EAs have more of a shared cul­ture and shared knowl­edge than typ­i­cal startup em­ploy­ees and, there­fore, might re­quire less se­nior man­age­ment time for on­board­ing new staff. This might mean that if we can hire peo­ple who already have a very good sense of what needs to be done at the or­ga­ni­za­tion, it could be good to hire even when there isn’t a strong need.

[2] We would ex­pect this to be a lot less for fu­ture hiring rounds (about 250 hours) since this also in­cludes ini­tial re­search, the de­sign of the pro­cess, and putting in place all the nec­es­sary tools.

[3] Ju­lia Wise pointed out to us af­ter­wards that there is ev­i­dence that blind­ing may lead to fewer women be­ing hired. Given that our main goal was mak­ing the pro­cess as fair as pos­si­ble to all can­di­dates, this did not up­date us against us­ing blind­ing in the fu­ture. EDIT: After talk­ing and think­ing more about this is­sue, we now think this does provide us with some rea­son not to blind in the fu­ture.

[4] Points in these sec­tions are based on a qual­i­ta­tive in­ter­nal eval­u­a­tion, the data anal­y­sis in Ap­pendix A, di­rect feed­back from ap­pli­cants, and an anony­mous feed­back form we in­cluded in the job ad and sent to all can­di­dates af­ter they had been re­jected.

[5] We checked with a lawyer if this was within our rights as an em­ployer, and we con­cluded that it was within Ger­man law. We don’t have an opinion whether this would have been per­mis­si­ble in an­other ju­ris­dic­tion. For what it’s worth, the rele­vant stan­dard in the US for de­cid­ing whether ad­minis­ter­ing a par­tic­u­lar pre-em­ploy­ment test is le­gal seems to be “job-re­lat­ed­ness”.

[6] We already had a good sense of their team fit and or­ga­nized Skype calls with some of the team mem­bers who were less fa­mil­iar with them. Other­wise we would likely not have al­lowed for a re­mote trial.