Mostly bio, occasionally forecasting/âepistemics, sometimes stats/âmedicine, too often invective.
Gregory Lewisđ¸
Thanks for the forest and funnel plotsâmuch more accurate and informative than my own (although it seems the core upshots are unchanged)
Iâll return to the second order matters later in the show, but on the merits, surely the discovery of marked small study effects should call the results of this analysis (and subsequent recommendation of Strongminds) into doubt?
Specifically:
The marked small study effect is difficult to control for, but it seems my remark of an âinteger divisionâ re. effect size is in the right ballpark. I would expect* (more later) real effects 2x-4x lower than thought could change the bottom lines.
Heterogeneity remains vast, but the small study effect is likely the best predictor of it versus time decay, intervention properties similar to strongminds, etc. It seems important to repeat the analysis controlling for small study effects, as overall impact calculation is much more sensitive to coefficient estimates which are plausibly confounded by this currently unaccounted for effect.
Discovery the surveyed studies appear riven with publication bias and p hacking should provide further scepticism of outliers (like the SM-specific studies heavily relied upon).
Re. each in turn:
1. I think the typical âCochrane-esqueâ norms would say the pooled effects and metaregression results are essentially meaningless given profound heterogeneity and marked small study effects. From your other comments, I presume you more favour a âBayesian Best Guessâ approach: rather than throwing up our hands if noise and bias loom large, we should do our best to correct for them and give the best estimate on the data.
In this spirit of statistical adventure, we could use the Eggerâs regression slope to infer the effect size the perfectly precise study would have (I agree with Briggs this is dubious technique, but seems one of the better available quantitative âbest guessesâ). Reading your funnel plot, the limit value is around 0.15 ~ 4x lower than the random effects estimate. Your output suggests it is higher (0.26), which I guess is owed to a multilevel model rather than the simpler one in the forest and funnel plots, but either way is ~2x lower than the previous ât=0â intercept values.
These are substantial corrections, and probably should be made urgently to the published analysis (given donors may be relying upon it for donation decisions).
2. As it looks like âstudy sizeâ is the best predictor of heterogeneity so far discovered, thereâs a natural fear that previous coefficient estimates for time decay and SM-intervention-like properties are confounded by it. So the overall correction to calculated impact could be greater than flat a 50-75% discount, if the less resilient coefficients âgo the wrong wayâ when this factor is controlled. I would speculate adding this in would give a further discount, albeit a (relatively) mild one: it is plausible that study size collides with time decay (so controlling results in somewhat greater persistence), but I would suspect the SM-trait coefficients go down markedly, so the MR including them would no longer give ~80% larger effects.
Perhaps the natural thing would be including study size/âprecision as a coefficient in the metaregressions (e.g. adding on to model 5), and using these coefficients (rather than univariate analysis previous done for time decay) in the analysis (again, pace the health warnings Briggs would likely provide). Again, this seems a matter of some importance, given the material risk of upending the previously published analysis.
3. As perhaps goes without saying, seeing a lot of statistical evidence for publication bias and p-hacking in the literature probably should lead one to regard outliers with even greater suspicionâboth because they are even greater outliers versus the (best guess) ârealâ average effect, and because the prior analysis gives an adverse prior of what is really driving the impressive results.
It is worth noting that the strongminds recommendation is surprisingly insensitive to the MR results, despite comprising the bulk of the analysis. With the guestimate as-is, SM removes roughly 12SDs (SD-years, I take it) of depression for 1k. When I set the effect sizes of the metaregressions to zero, the guestimate still spits out an estimate SM removes 7.1SDs for 1k (so roughly â7x more effective than givedirectly). This suggests that the ~5 individual small studies are sufficient for the evaluation to give the nod to SM even if (e.g.) the metaanalysis found no impact of psychotherapy.
I take this to be diagnostic the integration of information in evaluation is not working as it should. Perhaps the Bayesian thing to do is to further discount these studies given they are increasingly discordant from the (corrected) metaregression results, and their apparently high risk of bias given the literature they emerge from. There should surely be some non-negative value of the meta-analysis effect size which reverses the recommendation.
#
Back to the second order stuff. Iâd take this episode as a qualified defence of the âold fashioned way of doing thingsâ. There are two benefits in being aiming towards higher standards of rigour.
First, sometimes the conventions are valuable guard rails. Shortcuts may not just add expected noise, but add expected bias. Or, another way of looking at it, the evidential value of the work could be very concave with âstudy qualityâ.
These things can be subtle. One example I havenât previously mentioned on inclusion was the sampling/âextraction was incomplete. The first shortcut you took (i.e. culling references from prior meta-analyses) was a fair oneâsure, there might be more data to find, but thereâs not much reason to think this would introduce directional selection with effect size.
Unfortunately, the second sourceâreferences from your attempts to survey the literature on the cost of psychotherapyâwe would expect to be biased towards positive effects: the typical study here is a cost-effectiveness assessment, and such assessment is only relevant if the intervention is effective in the first place (if no effect, the cost-effectiveness is zero by definition). Such studies would be expected to ~uniformly report significant positive effects, and thus including this source biases the sample used in the analysis. (And hey, maybe a meta-regression doesnât find âfrom this source versus that oneâ is a significant predictor, but if so I would attribute it more to the literature being so generally pathological rather than cost-effectiveness studies are unbiased samples of effectiveness simpliciter).
Second, following standard practice is a good way of demonstrating you have ânothing up your sleeveâ: that you didnât keep re-analysing until you found results you liked, or selectively reporting results to favour a pre-written bottom line. Although I appreciate this analysis was written before the Simeonâs critique, prior to this one may worry that HLI, given its organisational position on wellbeing etc. would really like to find an intervention that âbeatsâ orthodox recommendations, and this could act as a finger on the scale of their assessments. (cf. ACEâs various shortcomings back in the day)
It is unfortunate that this analysis is not so much âavoiding even the appearance of improprietyâ but âlooking a bit susâ. My experience so far has been further investigation into something or other in the analysis typically reveals a shortcoming (and these shortcomings tend to point in the âfavouring psychotherapy/âSMâ direction).
To give some examples:
That something is up (i.e. huge hetereogeneity, huge small study effects) with the data can be seen on the forest plot (and definitely in the funnel plot). It is odd to skip these figures and basic assessment before launching into a much more elaborate multi-level metaregression.
It is also odd to have an extensive discussion of publication bias (up to and including ones own attempt to make a rubric to correct for it) without doing the normal funnel plot +/â- tests for small study effects.
Even if you didnât look for it, metareg in R will confront you with heterogeneity estimates for all your models in its output (cf.). One should naturally expect curiosity (or alarm) on finding >90% heterogeneity, which I suspect stays around or >90% even with the most expansive meta-regressions. Not only are these not reported in the write-up, but in the R outputs provided (e.g.here) these parts of the results have been cropped out.This was mistaken; mea maxima culpa.Mentioning prior sensitivity analyses which didnât make the cut for the write-up invites wondering what else got left in the file-drawer.
Thanks. Iâve taken the liberty of quickly meta-analysing (rather, quickly plugging your spreadsheet into metamar). I have further questions.
1. My forest plot (ignoring repeated measuresâmore later) shows studies with effect sizes >0 (i.e. disfavouring intervention) and <-2 (i.e.. greatly favouring intervention). Yet fig 1 (and subsequent figures) suggests the effect sizes of the included studies are between 0 and â2. Appendix B also says the same: what am I missing?
2. My understanding is it is an error to straightforwardly include multiple results from the same study (i.e. F/âU at t1, t2, etc.) into meta-analysis (see Cochrane handbook here): naively, one would expect doing so would overweight these studies versus those which report outcomes only once. How did the analysis account for this?
3. Are the meta-regression results fixed or random effects? Iâm pretty sure metareg in R does random effects by default, but it is intuitively surprising you would get the impact halved if one medium-sized study is excluded (Baranov et al. 2020). Perhaps what is going on is the overall calculated impact is much more sensitive to the regression coefficient for time decay than the pooled effect size, so the lone study with longer follow-up exerts a lot of weight dragging this upwards.
4. On the external validity point, it is notable that Baranov et al. was a study of pre-natal psychotherapy in Pakistan: it looks dubious that the results of this study would really double our estimates of effect persistenceâparticularly of, as I understand it, more general provision in sub-Saharan Africa. There seem facially credible reasons why the effects in this population could be persistent in a non-generalising way: e.g. that better maternal mental health post-partum means better economic decision making at a pivotal time (which then improves material circumstances thereafter).
In general inclusion seems overly permissive: by analogy, it is akin to doing a meta-analysis of the efficacy of aspirin on all cause mortality where you pool all of its indications, and are indifferent to whether is mono-, primary or adjunct Tx. I grant efficacy findings in one subgroup are informative re. efficacy in another, but not so informative that results can be weighed equally versus studies performed in the subgroup of interest (ditto including studies which only partly or tangentially involve any form of psychotherapyâinclusion looks dubious given the degree to which outcomes can be attributed to the intervention of interest is uncertain). Typical meta-analyses have much more stringent criteria (cf. PICO), and for good reason.
5. You elect for exp decay over linear decay in part as the former model has a higher R2 than the latter. What were the R2s? By visual inspection I guess both figures are pretty low. Similarly, it would be useful to report these or similar statistics for all of the metaregressions reported: if the residual heterogeneity remains very high, this supplies caution to the analysis: effects vary a lot, and we do not have good explanations why.
6. A general challenge here is metagression tends insensitive, and may struggle to ably disentangle between-study heterogeneityâespecially when, as here, thereâs a pile of plausible confounds owed to the permissive inclusion criteria (e.g. besides clinical subpopulation, what about location?). This is particularly pressing if the overall results are sensitive to strong assumptions made of the presumptive drivers of said heterogeneity, given the high potential for unaccounted-for confounders distorting the true effects.
7. The write-up notes one potential confounder to apparent time decay: better studies have more extensive followup, but perhaps better studies also report lesser effects. It is unfortunate small study effects were not assessed, as these appear substantial:
Note both the marked asymmetry (Eggers P < 0.001), as well as a large number of intervention favouring studies finding themselves in the P 0.01 to 0.05 band. Quantitative correction would be far from straightforward, but plausibly an integer divisor. It may also be worth controlling for this effect in the other metaregressions.
8. Given the analysis is atypical (re. inclusion, selection/âsearch, analysis, etc.) âanalysing as you goâ probably is not the best way of managing researcher degrees of freedom. Although it is perhaps a little too late to make a prior analysis plan, a multiverse analysis could be informative.
I regret my hunch is this would find the presented analysis is pretty out on the tail of âpsychotherapy favouring resultsâ: most other reasonable ways of slicing it lead to weaker or more uncertain conclusions.
I found (I think) the spreadsheet for the included studies here. I did a lazy replication (i.e. excluding duplicate follow-ups from studies, only including the 30 studies where ârawâ means and SDs were extracted, then plugging this into metamar). I copy and paste the (random effects) forest plot and funnel plot belowâdoubtless you would be able to perform a much more rigorous replication.
Re. the meta-analysis, are you using the regressions to get the pooled estimate? If so, how are the weights of the studies being pooled determined?
- Dec 28, 2022, 1:15 AM; 76 points) 's comment on StrongMinds should not be a top-rated charÂity (yet) by (
Per the LW discussion, I suspect youâd fare better spending effort actually presenting the object level case rather than meta-level bulverism to explain why these ideas (whatever they are?) are getting a chilly reception.
Error theories along the lines of âPresuming I am right, why do people disagree with me?â are easy to come by. Suppose indeed Landryâs/âyour work is indeed a great advance in AI safety: then perhaps indeed it is being neglected thanks to collective epistemic vices in the AI safety community. Suppose instead this work is bunk: then perhaps indeed epistemic vice on your part explains your confidence (complete with persecution narrative) in the work despite its lack of merit.
We could litigate which is more likelyâor, better, find what the ideal âbarâ insiders should have on when to look into outsider/âheterodox/âwhatever work (too high, and existing consensus is too entrenched, and you miss too many diamonds in the rough; too low, expert time is squandered submerged in dross), and see whether what has been presented so far gets far enough along the ?crackpot/â?genius spectrum to warrant the consultation and interpretive labour you assert you are rightly due.
This would be an improvement on the several posts so far just offering âhere are some biases which we propose explains why our work is not recognisedâ. Yet it would still largely miss the point: the âbarâ of how receptive an expert community will be is largely a given, and seldom that amenable to protests from those currently screened out it should be lowered. If the objective is to persuade this community to pay attention to your work, then even if in some platonic sense their bar is âtoo highâ is neither here nor there: you still have to meet it else they will keep ignoring you.
Taking your course of action instead has the opposite of the desired effect. The base rates here are not favourable, but extensive ârowing with the refâ whilst basically keeping the substantive details behind the curtain with a promissory note of âThis is great, but you wouldnât understand its value unless you were willing to make arduous commitments to carefully study why weâre rightâ is a further adverse indicator.
Iâd guess the distinction would be more âpublic interest disclosureâ rather than âofficialnessâ (after all, a lot of whistleblowing ends up in the media because of inadequacy in âformalâ channels). Or, with apologies to Yes Minister: âI give confidential briefings, you leak, he has been charged under section 2a of the Official Secrets Actâ.
The question seems to be one of proportionality: investigative or undercover journalists often completely betray the trust and (reasonable) expectations of privacy of its subjects/âtargets, and this can ethically vary from reprehensible to laudable depending on the value of what it uncovers (compare paparrazzi to Panorama). Where this nets out for disclosing these slack group messages is unclear to me.
Hello Luke,
I suspect you are right to say that no one has carefully thought through the details of medical career choice in low and middle income countriesâI regret I certainly havenât. One challenge is that the particular details of medical careers will not only vary between higher and lower income countries but also within these groups: I would guess (e.g.) Kenya and the Phillipines differ more than US and UK. My excuse would be that I thought Iâd write about what I knew, and that this would line up with the backgrounds of the expected audience. Maybe that was right in 2015, but much less so now, andâhopefullyâclearly false in the near future.
Although I fear Iâm little help in general, I can offer something more re. E2G vs. medical practice in Kenya.First, some miscellaneous remarks/âhealth warnings on the âlife savedâ figure(s):
The effect size interval of âphysician densityâ crosses zero (P value ~ 0.4(!)). So with more sceptical priors/âpractices you might take this as a negative result. E.g. I imagine a typical Givewell analyst would interpret this work as an indication training more doctors is not a promising intervention.
Both wealth and education factors are much more predictive, which is at least indicative (if not decisive) of what stands better prospects of moving the population health needle. This fits with general doctrine in public health around the social determinants of health, and rhymes with the typically unimpressive impacts of generally greater medical care/âexpenditure in lottery studies, RCTs, etc.
Ecological methods may be the best we (/âI) have, but are tricky, ditto the relatively small dataset and bunch on confounds. If I wanted to give my best guess central estimate of the impact of a doctor, I would probably adjust down further due to likely residual confounding, probably by a factor of ~~3. The most obvious example is physician density likely proxies healthcare workers generally, and doctors are unlikely to contribute the majority of the impact of a âmarginal block of healthcare staffâ.
I typically think the best use of this work is something like an approximate upper-bound: âWhen you control for the obvious, it is hard to see any impact of physicians in the aggregateâbut it is unlikely to be much greater than Xâ.
The âscalingâ effect of how much returns of physicians diminish as their density increases is a function of how the variables are linearized. Although this is indirectly data-driven (i.e. because the relationship is very non-linear, you linearise using a function which drives diminishing returns), it is not a âdiscoveryâ from the analysis itself.
Although available data (and maybe reality) is much too underpowered to show this, I would guess this scaling overrates the direct impact of medical personnel in lower-income settings: advanced medical training is likely overkill for primary prevention (or sometimes typical treatment) of the main contributors to lower-income countries burden of disease (e.g., for Kenya). If indeed the skill-mix should be tilted away from highly trained staff like physicians in low-income settings versus higher-income ones, then there is less of outsized effect of physician density.
Anyway, bracketing all the caveats and plugging in Kenyaâs current physician per capita figure into the old regression model gives a marginal response of ~40 DALYs, so a 15x multiplier versus the same for the UK. If one (very roughly) takes ~20-40 DALYs = 1 âlife savedâ, each year of Kenyan medical practice roughly roughly nets out to 5-10k USD of Givewell donations.
As you note, this is >>10% (at the upper end, >100%) of the average income of someone in Kenya. However, Iâd take the upshot as less âmaybe medical careers is a good idea for folks in lower-income countriesâ, but more âmaybe E2G in lower-income countries is usually a bad ideaâ as (almost by definition) the opportunities to generate high incomes to support large donaions to worthy causes will be scarcer.
Notably, the Kenyan diaspora in the US reports a median houshold income of ~$61 000, whilst the average income for a Kenyan physician is something like $35 000, so âE2G + emirgrationâ likely ends up ahead. Of course âJust move to a high income countryâ is not some trivial undertaking, and much easier said than doneâbut then again, the same applies to âJust become a doctorâ.
Asserting (as epicurean views do) death is not bad (in itself) for the being that dies is one thing. Asserting (as the views under discussion do) that death (in itself) is goodâand ongoing survival badâfor the being that dies is quite another.
Besides its divergence from virtually everyoneâs expressed beliefs and general behaviour, it doesnât seem to fare much better under deliberate reflection. For the sake of a less emotionally charged variant of Mathersâ example, responses to the Singerâs shallow pond case along the lines of, âI shouldnât step in, because my non-intervention is in the childâs best interest: the normal life they could âenjoyâ if they survive accrues more suffering in expectation than their imminent drowningâ appear deranged.
Cf. your update, Iâd guess the second order case should rely on things being bad rather than looking bad. The second-order case in the OP looks pretty slim, and little better than the direct EV case: it is facially risible supporters of a losing candidate owe the winning candidateâs campaign reparations for having the temerity to compete against them in the primary. The tone of this attempt to garner donations by talking down to these potential donors as if they were naughty children who should be ashamed of themselves for their political activity also doesnât help.
Iâd guess strenuous primary contests within a party does harm the winning candidateâs chances for the general (sort of like a much watered down version of third party candidates splitting the vote for D or R), but competitive primaries seem on balance neutral-to-good for political culture, thus competing in them when one has a fair chance of winning seems fair game.
It seems the key potential ânorm violation you owe us forâ is the significant out-of-state fundraising. If this was in some sense a âbugâ in the political system, taking advantage of it would give Salinas and her supporters a legitimate beef (and would defray the potential hypocrisy of supporters of Salinas attacking Flynn in the primary for this yet subsequently hoping to solicit the same to benefit Salinas for the generalâthe latter is sought to âbalance offâ the former). This looks colorable but dubious by my lights: not least, nationwide efforts for both parties typically funnel masses of out-of-state support to candidates in particular election races, and a principled distinction between the two isnât apparent to me.
I agree this form of argument is very unconvincing. That âpeople donât act as if Y is trueâ is a pretty rubbish defeater for âpeople believe Y is trueâ, and a very rubbish defeater for âX being trueâ simpliciter. But this argument isnât Ordâs, but one of your own creation.
Again, the validity of the philosophical argument doesnât depend on how sincerely a belief is commonly held (or whether anyone believes it at all). The form is simply modus tollens:If X (~sanctity of life from conception) then Y (natural embryo loss isâe.g. a much greater moral priority than HIV)
ÂŹY (Natural embryo loss is not a much greater moral priority than (e.g.) HIV)
ÂŹX (The sanctity of life from conception view is false)
Crucially, ÂŹY is not motivated by interpreting supposed revealed preferences from behaviour. Besides it being ~irrelevant (âPerson or group does not (really?) believe Y -->?? Y is falseâ) this apparent hypocrisy can be explained by ignorance rather than insincerity: itâs not like statistics around natural embryo loss are common knowledge, so their inaction towards the Scourge could be owed to them being unaware of it.
ÂŹY is mainly motivated by appeals to Yâs apparent absurdity. Ord (correctly) anticipates very few people on reflection would find Y plausible, and so would find if X indeed entailed Y, this would be a reason to doubt X. Again, it is the implausibility on rational reflection, not the concordance of practice to those who claim to believe it, which drives the argument .
SureâIâm not claiming âEA doctrineâ has no putative counter-examples which should lead us to doubt it. But these counter-examples should rely on beliefs about propositions not assessments of behaviour: if EA says âit is better to do X than Yâ, yet this seems wrong, this is a reason to doubt EA, but whether anyone is actually doing X (or X instead of Y) is irrelevant. âEA doctrineâ (ditto most other moral views) urges us to be much less selfishâthat I am selfish anyway is not an argument against it.
I think this piece mostly misunderstands Ordâs argument, through confusing reductios with revealed preferences. Although you quote the last sentence of the work in terms of revealed preferences, I think you get a better picture of Ordâs main argument from his description of it:
The argument then, is as follows. The embryo has the same moral status as an adult human (the Claim). Medical studies show that more than 60% of all people are killed by spontaneous abortion (a biological fact). Therefore, spontaneous abortion is one of the most serious problems facing humanity, and we must do our utmost to investigate ways of preventing this deathâeven if this is to the detriment of other pressing issues (the Conclusion).
Note thereâs nothing here about hypocrisy, and the argument isnât âOrd wants us to interpret peopleâs departure from their stated moral beliefs, not as moral failure or selfishness or myopia or sin, but as an argument against peopleâs stated moral claims.â
This wouldnât be much of an argument anyway: besides the Phil-101 points around âEven if pro-lifers are hypocrites their (pretended) belief could still be trueâ, itâs still very weak as an abductive consideration. If indeed pro-lifers were hypocritical this gives some evidence their (stated) beliefs are false (through a few mechanisms Iâll spare elaborating), this counts for little unless this hypocrisy was of a remarkably greater degree than others. As moral hypocrisy is all-but-universal, and merely showing (e.g.) that stereotypical Kantians sometimes lie, utilitarians give less than they say they ought to charity (etc. etc.) is not much of a revelation, I doubt this (or the other extensions in the OP) bear much significance in terms of identifying particularly discrediting hypocrisy.
The challenge of the Scourge is that a common bioconservative belief (âThe embryo has the same moral status as an adult humanâ) may entail another which seems facially highly implausible (âTherefore, spontaneous abortion is one of the most serious problems facing humanity, and we must do our utmost to investigate ways of preventing this deathâeven if this is to the detriment of other pressing issuesâ). Many (most?) find the latter bizarre, so if they believed it was entailed by the bioconservative claim would infer this claim must be false. Again, this reasoning is basically orthogonal to any putative hypocrisy among those asserting its truth: even if it were the case (e.g.) the Catholic Church was monomaniacal in its efforts to combat natural embryo loss, the argument would still lead me to think they were mistaken.
Ord again:
One certainly could save the Claim by embracing the Conclusion, however I doubt that many of its supporters would want to do so. Instead, I suspect that they would either try to find some flaw in the argument, or abandon the Claim. Even if they were personally prepared to embrace the Conclusion, the Claim would lose much of its persuasive power. Many of the people they were trying to convince are likely to see the Conclusion as too bitter a pill, and to decide that if these embryo-related practices are wrong at all, it cannot be due to the embryo having full moral status.
The guiding principle I recommend is âdisclose in the manner which maximally advantages good actors over bad actorsâ. As you note, this usually will mean something between âpublic broadcastâ and âkeep it to yourselfâ, and perhaps something in and around responsible disclosure in software engineering: try to get the message to those who can help mitigate the vulnerability without it leaking to those who might exploit it.
On how to actually do it, I mostly agree with Bloomâs answer. One thing to add is although I canât speak for OP staff, Esvelt, etc., Iâd expectâlike meâthey would far rather have someone âpesterâ them with a mistaken worry than see a significant concern get widely disseminated because someone was too nervous to reach out to them directly.
Speaking for myself: If something comes up where you think I would be worth talking to, please do get in touch so we can arrange a further conversation. I donât need to know (and I would recommend against including) particular details in the first instance.
(As perhaps goes without saying, at least for bioâand perhaps elsewhereâI strongly recommend against people trying to generate hazards, âred teamingâ, etc.)
- How to disÂclose a new x-risk? by Aug 24, 2022, 1:35 AM; 20 points) (
- Sep 8, 2023, 8:24 AM; 2 points) 's comment on Long-Term FuÂture Fund Ask Us AnyÂthing (SeptemÂber 2023) by (
Yes, corrected. Thanks!
Thanks for this, Richard.
As you (and other commenters) note, another aspect of Pascalian probabilities is their subjectivity/âambiguity. Even if you canât (accurately) generate âwhat is the probability I get hit by a car if I run across this road now?â, you have ânumbers you can stand somewhat nearâ to gauge the riskâor at least âthis has happened beforeâ case studies (cf. asteroids). Although you can motivate more longtermist issues via similar means (e.g. âWell, weâve seen pandemics at least this bad beforeâ, âWhatâs the chance folks raising grave concern about an emerging technology prove to be right?â) you typically have less to go on and are reaching further from it.
I think we share similar intuitions: this is a reasonable consideration, but it seems better to account for it quantitatively (e.g. with a sceptical prior or discount for âdistance from solid epistemic groundâ) rather than a qualitative heuristic. E.g. it seems reasonable to discount AI risk estimates (potentially by orders of magnitude) if it all seems very outlandish to youâbut then you should treat these âall things consideredâ estimates at face value.
Thanks for the post.
As you note, whether you use exponential or logistic assumptions is essentially decisive for the long-run importance of increments in population growth. Yet we can rule out exponential assumptions which this proposed âCharlemagne effectâ relies upon.
In principle, boundless forward compounding is physically impossible, as there are upper bounds on growth rate from (e.g.) the speed of light, and limitations on density from the amount of available matter in a given volume. This is why logistic functions, not exponential ones, are used for modelling populations in (e.g.) ecology.
Concrete counter-examples to the exponential modelling are thus easy to generate. To give a couple:A 1% constant annual growth rate assumption would imply saving one extra survivor 4000 years ago would have result in a current population of ~ 2* 10 ^17: 200 Quadrillion people.
A âconservativeâ 0.00001% annual growth rate still results in populations growing one order of magnitude every ~25 million years. At this rate, you end up with a greater population than atoms in the observable universe within 2 billion years. If you run until the end of the stelliferous era (100 trillion years) at the same rate, you end up with populations on the order of 10^millions, with a population density basically 10^thousands every cubic millimetre.
I donât find said data convincing re. CFAR, for reasons I fear youâve heard me rehearse ad nauseum. But this is less relevant: if it were just âCFAR, as an intervention, sucksâ Iâd figure (and have figured over the last decade) that folks donât need me to make up their own mind. The worst case, if that was true, is wasting some money and a few days of their time.
The doctor case was meant to illustrate that sufficiently consequential screw-ups in an activity can warrant disqualification from doing it againâeven if one is candid and contrite about them. I agree activities vary in the prevalence of their âfailure intolerableâ tasks (medicine and aviation have a lot, creating a movie or a company very few). But most jobs which involve working with others have some things for which failure tolerance is ~zero, and these typically involve safety and safeguarding. For example, a teacher who messes up their lesson plans obviously shouldnât be banned from their profession as a first resort; yet disqualification looks facially appropriate for one who allows their TA to try and abscond with one of their students on a field trip.
CFARâs track record includes a litany of awful mistakes re. welfare and safeguarding where each taken alone would typically warrant suspension or disqualification, and in concert should guarantee the latter as it demonstratesârather than (e.g.) âgrave mistake which is an aberration from their usually excellent standardsââa pattern of gross negligence and utter corporate incompetence. Whatever degree of intermediate risk attending these workshops constitute is unwise to accept (or to encourage others accepting), given CFAR realising these risks is already well-established.
- Feb 23, 2023, 4:39 PM; 14 points) 's comment on EA, SexÂual HarassÂment, and Abuse by (
CFARâs mistakes regarding Brent
Although CFAR noted it needed to greatly improve re. âLack of focus on safetyâ and âInsufficient Institutional safeguardsâ, evidence these have improved or whether they are now adequate remains scant. Noting âwe have reformed various thingsâ in an old update is not good enough.Whether anything would be âgood enoughâ is a fair question. If I, with (mostly) admirable candour, describe a series of grossly incompetent mistakes during my work as a doctor, the appropriate response may still be to disqualify me from future medical practice (there are sidelines re. incentives, but they donât help). The enormity of fucking up as badly as (e.g.[!!!]):
Of the interactions CFAR had with Brent, we consider the decision to let him assist at ESPRâa program we helped run for high school studentsâto have been particularly unwise. While we were not aware of any allegations of abuse at the time of that decision, many of us did feel that his behavior was sometimes manipulative, and that he was often dismissive of standard ethical norms. We consider it an obvious error to have ignored these behaviors when picking staff for a youth program.
Once the allegations about Brent became public, we notified ESPR students and their parents about them. We do not believe any students were harmed. However, Brent did invite a student (a minor) to leave camp early to join him at Burning Man. Beforehand, Brent had persuaded a CFAR staff member to ask the camp director for permission for Brent to invite the student. Multiple other staff members stepped in to prevent this, by which time the student had decided against attending anyway.
This student does not believe they were harmed. Nevertheless, we consider this invitation to have been a clear violation of common sense ethics. After this incident, CFAR made sure not to invite Brent back to any further youth programs, but we now think it was a mistake not to have gone further and banned Brent from all CFAR events. Additionally, while we believe the staff memberâs action resulted mostly from Brentâs influence causing them not to register the risks, we and they nonetheless agreed that it would be best to part ways, in light both of this incident and a general shared sense of heading in different directions. They left CFARâs employment in November 2018; they will not be in any staff or volunteer roles going forward, but they remain a welcome member of the alumni community.
Should be sufficient to disqualify CFAR from running âintensiveâ residential retreats, especially given the âinner workâ and âmutual vulnerabilityâ they (at least used to) have.
I would also hope a healthy EA community would warn its members away from things like this. Regardless, I can do my part: for heavenâs sake, just donât go.
Howdy, and belatedly:
0) I am confident I understand; I just think itâs wrong. My impression is HIMâs activity is less âusing reason and evidence to work out what does the most goodâ, but rather âusing reason and evidence to best reconcile prior career commitments with EA principlesâ.
By analogy, if I was passionate about (e.g.) HIV/âAIDS, education, or cancer treatment in LICs, the EA recommendation would not (/âshould not) be I presume I maintain this committment, but rather soberly evaluate how interventions within these areas stack up versus all others (with the expectation I would be very unlikely to discover the best interventions which emerge from this analysis will line up with what my passions previously aligted upon). Instead setting up a âGivewell for education interventionsâ largely misses the point (and most of the EV âon the tableâ).
So too here. It would be surprising to discover medical careersâtypically selected before acquaintance with EA principlesâwould be optimal or near-optimal by their lights (Iâd be surprised if m/âany EAs who werenât already doctors thought it was). The face-value analysis is pessimistic on the âis this bestâ question, notwithstanding (e.g.) there is a lot of variance within field to optimise: HIV/âAIDS interventions vary in effectiveness by orders of magnitude, yet that doesnât make them priorities on the current margin. As, to a first approximation, reality works in first-order terms, weâd want some very good reasons for second order considerations nonetheless carrying the day: sentiments like âbig tentâ, âEA is a questionâ (etc.) can support anything (would it apply to PlayPumps?), so we should attempt to weigh these things up.
Your first point of clarification illustrates the âopacityâ I have in mind. âNot necessarily encouragingâ folks to apply to medical school implies a lot of epistemic wiggle room: âShould I enter medicine?â and âShould I leave medicine?â are different but closely related questions (consider a 17 year old applying to medicine versus an 18 year old first year student), and answers to the former sense-check answers to the latter. If you really think having impact as a doctor is for many people some of the best things they can do, this suggests for similar people you would encourage them entering the profession (this doesnât imply HIM should start doing this, but I think most in EA-land would find this result surprising and worth explorationânot least, it suggests a re-write of the 80k profile.) In contrast, if the answer is âeven for those initially minded to enter medicine, weâd usually recommend against it as an EA career choiceâ, then there should be a story why this usual recommendation is greatly attenuated (or reversed?) for those already in the professionâparticularly at an early stage like medical school. Again, this doesnât govern HIM strategyâbut it is informative, and knowing what you yourself think is the answer is important for transparent communication with your audience (even if they find this uncomfortable).
1) Regardless of the semantics of whether one should call someone like myself a âmedicâ or not now, the substantive issue seems to be around whether medicine (generally speaking) is a high impact activity or not. Suppose (i.e. Iâm not claiming this is the story for either of these professions I use as examples) (a):
âHigh Impact lawâ: where the folks in the profession find their highest impact options often involve the practice of law in their âday jobâ, or ânot strictly legalâ roles where their legal training is an important-to-crucial piece of career capital.
Contrast (b):
âHigh Impact accountancyâ: where folks in this profession find their highest impact options very rarely involve the practice of accountancy, and their best career options are typically those where their accounting background is only tangentially relevant (e.g. acquaintance with business operations, a âhead for figuresâ).
In the latter case, âhigh impact accountancyâ looks like an odd term if the real message is to provide accountants with better career options which typically involve leaving the profession. If medicine was like (a), all seems well; but I think it is like (b), thus we disagree.
2) Iâd be surprised if most of the folks I mentioned would find several years of medical experience valaubleâespecially (for the key question of career choice) whether this was a leading opportunity versus alternative ways of spending 10-20% of their working lives. I can ask around, but if you have testimony to hand youâre welcome to correct me. Iâd guess medical experience is much more relevant for much more medically adjacent (or simply medical) careersâbut, per grandparent, these careers tend to be a lot less impactful in the first place.
3) Our hypothetical Alice may be right about the options you note being âhigher impactâ than typical practice. Yet effectiveness is multiplier stacking (cf.), so Bob (who doesnât labour under the âhaving impact as a doctorâ constraint) can still expect 10-100x more impact. The latter two examples you give (re. earning to give and working in a LIC) allow direct estimation:
Re. E2G, US and UK doctors are in the top ~5% of their respective populations in earnings. Many other careers plausibly accessible to doctors (e.g. technical start-ups, quant trading, SWE, consulting) have income distributions which have either dramatically higher expected earnings, higher median earnings (e.g. friends of mine in some of these fields had higher starting salary than my expected peak medical salary), or both. This all sets aside that marginal returns to further money where there is a lot of aligned money looking for interventions to fund may be much lower now (cf. âearning to giveâ careers typically finding themselves a long way down 80k recommendations; forum discourse ad nauseum about âtalent constraintâ, unease about all the lucre sloshing around, etc. etc.).
Re. LIC practice, if we take the 2-3 omag multiplier at face value (this looks implausible at the upper end), then combining that with 2ish DALYs/âyear of practice in a high income countries (taking my figures at face value, which are likely too high, you get 2*300 = 600 DALYs. In Givewell donations, with some conversion of (say) 40 DALYs = one âlife savedâ (not wildly unreasonable as the lives saved are typically <5 year olds), this is ~~ 70 000 dollars/âyear. This is in the reach of E2G doctors (leave alone careers E2G more broadly), and the real number is almost surely lower (probably by an integer factor): the âmedical practiceâ side of the equation is much less rigorous than the givewell CEE, and should be anticipated to regress down.
As you say, various constraints (professional or personal) may rule out these other options: perhaps I aim at earning to give, but it happens that medical practice is my most lucrative employment (obviously much more plausible if one is later in oneâs career); perhaps even if in general the sort of person drawn to medicine can make better contributions outside of the profession, this is not true for me in particular. Yet candour seems to oblige foregrounding such constraints often cut 90%+ of potential impact (and thus the importance of testing whether these constraints are strict).
4) Although comparators are tricky (e.g. if my writing on medical careers was vastly less effective it would be hard to tell), the content of the career plan changes noted in the OP would be more or less reassuring. re. what high impact med is accomplishing. Per above, as getting the last multipliers are important, HIMâs impact is largely determined by the tail of highest impact plan changes.
Hello Joel,
0) My bad re rma.rv output, sorry. Iâve corrected the offending section. (Iâll return to some second order matters later).
1) I imagine climbing in Mexico is more pleasant than arguing statistical methods on the internet, so Iâve attempted to save you at least some time on the latter by attempting to replicate your analysis myself.
This attempt was only partially successful: I took the âLay or Group cleanerâ sheet and (per previous comments) flipped the signs where necessary so only Houshofer et al. shows a negative effect. Plugging this into R means I get basically identical results for the forest plot (RE mean 0.50 versus 0.51) and funnel plot (Eggers lim value 0.2671 vs. 0.2670). I get broadly similar but discordant values for the univariate linear and exp decay, as well as model 1 in table 2 [henceforth âmodel 3â] (intercepts and coefficients ~ within a standard error of the write-upâs figures), and much more discordant values for the others in table 2.
I expect this âfailure to fully replicateâ is mostly owed to a mix of i) very small discrepancies between the datasets we are working off are likely to be amplified in more complex analysis than simpler forest plots etc. ii) Iâd guess the covariates would be much more discrepant, and there are more degrees of freedom in how they could be incorporated, so it is much more likely we arenât doing exactly the same thing (e.g. âLaynessâ in my sheet seems to be ordinalâvalues of 0-3 depending on how well trained the provider was, whilst the table suggests it was coded as categorical (trained or not)in the original analysis. Hopefully it is âclose enoughâ for at least some indicative trends not to be operator error. In the spirit of qualified reassurance hereâs my funnel plot:
2) Per above, one of the things I wanted to check is whether indeed you see large drops in effect size when you control for small studies/âpublication bias/âetc. You canât neatly merge (e.g.) Egger into meta-regression (at least, I canât), but I can add in study standard error as a moderator. Although there would be many misgivings of doing this vs. (e.g.) some transformation (although I expect working harder to linearize etc. would accentuate any effects), there are two benefits: i) extremely simple, ii) it also means the intercept value is where SE = 0, and so gives an estimate of what a hypothetical maximally sized study would suggest.
Adding in SE as a moderator reduces the intercept effect size by roughly half (model 1: 0.51 â 0.25; model 2: 0.42 â 0.23; model 3: 0.69 ->0.36). SE inclusion has ~no effect on the exponential model time decay coefficient, but does seem to confound the linear decay coefficient (effect size down by a third, so no longer a significant predictor) and the single group or individual variable I thought I could helpfully look at (down by ~20%). I take this as suggestive there is significant confounding of results by small study effects, and bayesian best guess correction is somewhere around a 50% discount.
3) As previously mentioned, if you plug this into the guestimate you do not materially change the CEA (roughly 12x to 9x if you halve the effect sizes), but this is because this CEA will return strongminds at least seven times better than cash transfers even if the effect size in the MRAs are set to zero. I did wonder how negative the estimate would have to be to change the analysis, but the gears in the guestimate include logs so a lot of it errors if you feed in negative values. I fear though, if it were adapted, it would give absurd results (e.g. still recommending strongminds even if the MRAs found psychotherapy exacerbated depression more than serious adverse life events).
4) To have an empty file-drawer, I also looked at âsourceâ to see whether cost survey studies gave higher effects due to the selection bias noted above. No: non-significantly numerically lower.
5) So it looks like the publication bias is much higher than estimated in the write-up: more 50% than 15%. I fear part of the reason for this discrepancy is the approach taken in Table A.2 is likely methodologically and conceptually unsound. Iâm not aware of a similar method in the literature, but it sounds like what you did is linearly (?meta)regress g on N for the metaPsy dataset (at least, I get similar figures when I do this, although my coefficient is 10x larger). If so, this doesnât make a lot of sense to meâSE is non-linear in N, the coefficient doesnât limit appropriately (e.g. an infinitely large study has +inf or -inf effects depending on which side of zero the coefficient is), and youâre also extrapolating greatly out of sample for the correction between average study sizes. The largest study in MetaPsy is ~800 (I see two points on my plot above 650), but you are taking the difference of N values at ~630 and ~2700.
Even more importantly, it is very odd to use a third set of studies to make the estimate versus the two literatures you are evaluating (given an objective is to compare the evidence bases, why not investigate them directly?) Treating them alike also assumes they share the same degree of small study effectsâthere are just at different points âalong the lineâ because one tends to have bigger studies than the other. It would seem reasonable to consider that the fields may differ in their susceptibility to publication bias and p-hacking, soâcontrolling for Nâcash transfer studies are less biased than psychotherapy ones. As we see from the respective forest plots, this is clearly the caseâthe regression slope for psychotherapy is like 10x or something as slope-y as the one for cash transfers.
(As a side note, MetaPsy lets you shove all of their studies into a forest plot, which looks approximately as asymmetric as the one from the present analysis:)
6) Back to the meta stuff.
I donât suspect either you or HLI of nefarious or deceptive behaviour (besides priors, this is strongly ruled against by publishing data that folks could analyse for themselves). But I do suspect partiality and imperfect intellectual honesty. By loose analogy, rather than a football referee who is (hopefully) unbiased but perhaps error prone, this is more like the manager of one of the teams claiming âobviouslyâ their side got the rough end of the refereeing decisions (maybe more error prone in general, definitely more likely to make mistakes favouring one âsideâ, but plausibly/âprobably sincere), but not like (e.g.) a player cynically diving to try and win a penalty. In other words, I suspectâif anythingâyou mostly pulled the wool over your own eyes, without really meaning to.
One reason this arises is, unfortunately, the more I look into things the more cause for concern I find. Moreover, the direction of concern re. these questionable-to-dubious analysis choices strongly tend in the direction of favouring the intervention. Maybe I see what I want to, but canât think of many cases where the analysis was surprisingly incurious about a consideration which would likely result in the effect size being adjusted upwards, nor where a concern about accuracy and generalizability could be further allayed with an alternative statistical technique (one minor example of the latterâit looks like you coded Mid and Low therapy as categoricals when testing sensitivity to therapyness: if you ordered them I expect youâd get a significant test for trend).
Iâm sorry again for mistaking the output you were getting, butârespectfullyâit still seems a bit sus. It is not like one should have had a low index of suspicion for lots of heterogeneity given how permissively you were including studies; although Q is not an oracular test statistic, P<0.001 should be an prompt to look at this further (especially as you can look at how Q changes when you add in covariates, and lack of great improvement when you do is a further signal); and presumably the very low R2 values mentioned earlier would be another indicator.
Although meta-analysis as a whole is arduous, knocking up a forest and funnel plot to have a look (e.g. whether one should indeed use random vs. fixed effects, given one argument for the latter is they are less sensitive to small study effects) is much easier: I would have no chance of doing any of this statistical assessment without all your work getting the data in the first place; with it, I got the (low-quality, but informative) plots in well under an hour, and do what youâve read above took a morning.
I had the luxury of not being on a deadline, but Iâm afraid a remark like âI didnât feel like I had time to put everything in both CEAs, explain it, and finish both CEAs before 2021 ended (which we saw as important for continuing to exist)â inspires sympathy but not reassurance on objectivity. I would guess HLI would have seen not only the quality and timeliness of the CEAs as important to its continued existence, but also the substantive conclusions they made: âWe find the intervention weâve discovered is X times better than cash transfers, and credibly better than Givewell recsâ seems much better in that regard than (e.g.) âWe find the intervention we previously discovered and recommended, now seems inferior to cash transfersâleave alone Givewell top charitiesâby the lights of our own further assessmentâ.
Besides being less pleasant, speculating over intentions is much less informative than the actual work itself. I look forward to any further thoughts you have on whether I am on the right track re. correction for small study effects, and I hope future work will indeed show this intervention is indeed as promising as your original analysis suggests.