TIO: A mental health chatbot

This post de­scribes a men­tal health chat­bot, which we call the Talk It Over Chat­bot (some­times TIO for short). The first work on the bot started in 2018, and we have been work­ing more earnestly since 2019.

My main aim is to ex­plore whether I should seek fund­ing for it. (It is cur­rently vol­un­teer-led)

Here’s a sum­mary of my find­ings:

  • Us­ing di­rect-re­sults think­ing: a cost-effec­tive­ness model in­di­cates that the di­rect re­sults of this pro­ject may be com­pet­i­tive with the bench­mark (Strong Minds), as as­sessed with a cost-effec­tive­ness model, how­ever this is based on a num­ber of as­sump­tions.

  • Us­ing hits-based think­ing: this bot is dis­tinc­tive in hav­ing a con­ver­sa­tional in­ter­face with a pro­grammed bot and also al­low­ing that con­ver­sa­tion to be free-flow­ing. This dis­tinc­tive ap­proach may al­low us to re­solve an ev­i­dence gap on the effec­tive­ness of cer­tain therapies

This post in­vites:

  • Ma­te­rial ar­gu­ments for or against fund­ing this

  • In­ter­est from funders

  • In­ter­est from volunteers

Our mo­ti­va­tions when we started this project

The bot was cre­ated by a team of vol­un­teers, the ini­tial three of which were cur­rent/​former vol­un­teers with Sa­mar­i­tans, a well-known UK char­ity which pro­vides emo­tional sup­port to the dis­tressed and de­spairing, in­clud­ing those who are suici­dal.

A big part of our mo­ti­va­tion was the fact that we knew there was a big de­mand for the Sa­mar­i­tans ser­vice, but nowhere near enough sup­ply. This meant that our ini­tial pro­to­type need not out­perform a hu­man, it need only out­perform (e.g.) call­ing the Sa­mar­i­tans and get­ting the en­gaged tone.

We also sus­pected that some users might pre­fer speak­ing to soft­ware, and this sus­pi­cion has been con­firmed.

Another mo­ti­va­tion was the fact that lots of im­por­tant de­ci­sions we made were un­der-ev­i­denced or un­ev­i­denced. Speci­fi­cally, when de­cid­ing what we, as in­di­vi­d­ual ther­a­peu­tic listen­ers, should say to a ser­vice user, we were of­ten op­er­at­ing in an ev­i­dence vac­uum. Our con­ver­sa­tions with pro­fes­sion­als sug­gest that this prob­lem is com­mon to other parts of the men­tal health space, so cre­at­ing a new source of data and ev­i­dence could be highly valuable.

Overview of the cost-effec­tive­ness model

I referred ear­lier to “di­rect-re­sults think­ing” (as op­posed to hits-based think­ing). This refers to di­rect in­ter­ven­tions hav­ing rel­a­tively im­me­di­ate effects, which can be best as­sessed with a cost-effec­tive­ness model.

The model that I have built in­cor­po­rates the ex­pected costs of run­ning the bot and the ex­pected im­prove­ments in the wellbe­ing of the users.

The model tries to en­visage a medium-term fu­ture state for the bot (rather than, e.g., what the pro­ject will look like im­me­di­ately af­ter fund­ing it).

The de­ci­sion crite­rion for the mod­el­led con­sid­er­a­tions is that the bot’s cost-effec­tive­ness should be com­pet­i­tive with StrongMinds. (Note that the un­mod­el­led con­sid­er­a­tions may also be ma­te­rial)

Model­led Re­sults of the cost-effec­tive­ness model

  • The re­al­is­tic as­sump­tion-set for the bot does come out with an an­swer that is (co­in­ci­den­tally) similar to the bench­mark (StrongMinds).

  • Op­ti­mistic and pes­simistic as­sump­tion-sets give out­puts that are sub­stan­tially bet­ter than /​ worse than StrongMinds.

The model also ex­plores al­ter­na­tive sce­nar­ios:

    • If we ne­go­ti­ate sup­port from Google, then the bot sub­stan­tially out­performs StrongMinds (by a fac­tor of 6.8). Other ways of sourc­ing users may have a similar effect.

    • If we use a differ­ent as­sump­tion for how to con­vert be­tween differ­ent im­pact mea­sure­ment scales, then the bot sub­stan­tially out­performs StrongMinds (by a fac­tor of 4.4)

    • If we use a differ­ent as­sump­tion for how to con­vert be­tween in­ter­ven­tions of differ­ent du­ra­tions, then the bot out­performs StrongMinds (by a fac­tor of 2.1)

    • If we make differ­ent as­sump­tions about the coun­ter­fac­tual, we may con­clude that the bot un­der­performs StrongMinds, po­ten­tially ma­te­ri­ally (al­though it’s hard to say how much; see sep­a­rate ap­pendix on this topic)

My in­ter­pre­ta­tion of these find­ings is that the model’s es­ti­mates of the bot’s effec­tive­ness are in the same bal­l­park as StrongMinds, as­sum­ing we ac­cept the model’s as­sump­tions.

The model can be found here.

Key as­sump­tions un­der­ly­ing the model

In­ter­ested read­ers are in­vited to re­view the ap­pen­dices which set out the as­sump­tions in some de­tail.

A key fea­ture is that the model sets out a medium term fu­ture state for the bot, af­ter some im­prove­ments have been made. Th­ese as­sump­tions in­tro­duce risks, how­ever for a donor with ap­petite for risk and a gen­eral will­ing­ness to sup­port promis­ing early-stage/​startup pro­jects, this need not be an is­sue.

If I were to pick one as­sump­tion which I think is mostly likely to ma­te­ri­ally change the con­clu­sions, it would be the as­sump­tion about the coun­ter­fac­tual. This is dis­cussed in its own ap­pendix. This as­sump­tion is likely to be favourable to the bot’s cost-effec­tive­ness, and it’s un­clear how much. If you were to fund this pro­ject, you would have to do so on one of the fol­low­ing bases:

  • You have com­fort about the coun­ter­fac­tu­als and there­fore you sup­port the de­vel­op­ment of the bot based on “di­rect-re­sults think­ing” alone

    • This might be be­cause you buy the ar­gu­ments set out in the ap­pendix about why the coun­ter­fac­tu­als are not a worry for the way the bot is cur­rently im­ple­mented (via Google ads)...

    • Or be­cause the over­all short­age of men­tal health re­sources (globally) seems to in­crease the prob­a­bil­ity that the pro­ject will find new con­texts which *do* have favourable coun­ter­fac­tu­als.

  • A be­lief that the un­mod­el­led fac­tors are ma­te­ri­ally favourable, such as the po­ten­tial to im­prove the ev­i­dence base for cer­tain types of ther­apy (“hits-based think­ing”)

Un­mod­el­led ar­gu­ments for and against fund­ing the bot

This sec­tion refers to fac­tors which aren’t re­flected in the model.

For

  • Po­ten­tial to rev­olu­tion­ise the psy­chol­ogy ev­i­dence base?

    • The bot cre­ates a valuable dataset

    • With more scale, the dataset can be used to an­swer ques­tions like “If a pa­tient/​client/​user says some­thing like X, what sort of re­sponses are more or less effec­tive at helping them?”

    • There is cur­rently a short­age of data and ev­i­dence at this level of granularity

    • It could also provide more ev­i­dence for cer­tain types of ther­a­peu­tic in­ter­ven­tion which are cur­rently un­der­ev­i­denced.

    • This is ex­plored in more de­tail in Ap­pendix 3

  • The TIO chat­bot’s free-text-heavy in­ter­face is dis­tinc­tive and may al­low for many other ther­a­peu­tic ap­proaches in the future

    • The con­ver­sa­tional in­ter­face al­lows lots of free­dom in terms of what the user can say, like an ac­tual con­ver­sa­tion.

    • This is highly un­usual com­pared to other men­tal health apps, which are gen­er­ally much more struc­tured than TIO

    • While the ap­proach is cur­rently fo­cused on a spe­cific ther­a­peu­tic ap­proach (an ap­proach in­spired the Sa­mar­i­tans listen­ing ap­proach, which has similar­i­ties to Roge­rian ther­apy), other ap­proaches may be pos­si­ble in the future

  • Good feed­back loops and low “good­hart­ing” risk

    • Good­hart­ing risk refers to the risk that the im­me­di­ately available mea­sures are poor rep­re­sen­ta­tives for the ul­ti­mate im­pacts that we care about. (Ex­am­ples in­clude hos­pi­tals op­ti­mis­ing for shorter wait­ing times rather than health, or schools op­ti­mis­ing for good exam re­sults rather than good ed­u­ca­tion.)

    • This pro­ject does not have zero good­hart­ing risk (are we sure that self-re­ported wellbe­ing re­ally is the “right” ul­ti­mate goal?) How­ever the good­hart­ing risk is low by com­par­i­son with most other char­i­ta­ble projects

    • It also has quick feed­back loops, which mean that it’s eas­ier to op­ti­mise for the things the pro­ject is try­ing to achieve

  • Low hang­ing fruit

    • At the mo­ment the pro­ject is mak­ing good progress given that it is vol­un­teer-run

    • How­ever there are easy im­prove­ments to the bot which could be made with a bit of funding

  • Pro­ject has ex­ist­ing tech and experience

    • Un­like a pro­ject start­ing from scratch, this pro­ject has ex­pe­rience in cre­at­ing mul­ti­ple var­i­ants of this bot, and a sub­stan­tial dataset of c10,000 conversations

    • The ex­ist­ing team in­cludes al­most 40 years of Sa­mar­i­tans listen­ing ex­pe­rience + other psy­chol­ogy backgrounds

  • Good con­nec­tions with volunteers

    • The found­ing team is well-con­nected with groups which could sup­ply vol­un­teers to sup­port the work of the pro­ject; in­deed this pro­ject has been wholly vol­un­teer-run thus far

    • A partly vol­un­teer-run pro­ject could cre­ate cost-effec­tive­ness advantages

Against

  • Ques­tion­able ev­i­dence base for un­der­ly­ing ther­a­peu­tic approach

    • This bot has de­parted from many other men­tal health apps by not us­ing CBT (CBT is com­monly used in the men­tal health app space).

    • In­stead it’s based on the ap­proach used by Sa­mar­i­tans. While Sa­mar­i­tans is well-es­tab­lished, the ev­i­dence base for the Sa­mar­i­tans ap­proach is not strong, and sub­stan­tially less strong than CBT

    • Part of my mo­ti­va­tion was to im­prove the ev­i­dence base, and hav­ing seen the re­sults thus far, I have more faith in the bot’s ap­proach, al­though more work to strengthen the ev­i­dence base would be valuable

  • Coun­ter­fac­tu­als: would some­one else have cre­ated a similar app?

    • While the ap­proach taken by *this* chat­bot is dis­tinc­tive, the space of men­tal health apps is quite busy.

    • Per­haps some­one else would (even­tu­ally) have come up with a similar idea

    • I be­lieve that our team can be bet­ter than the av­er­age team who would work on this, be­cause we are nat­u­rals at be­ing ev­i­dence-based. How­ever I’m bi­ased.

  • Lan­guage may put a limit on scalability

    • Cur­rently, if some­one wants to build a mechanism for the bot to de­tect what the user has said, they need (a) a care­ful un­der­stand­ing of how to re­spond em­pa­thet­i­cally and (b) a nu­anced un­der­stand­ing of the lan­guage you are speak­ing in

    • Cur­rently, to make this bot work in a differ­ent lan­guage would not be a sim­ple task

    • This is im­por­tant, as men­tal health needs in the global south are sub­stan­tial and neglected

    • Note that fur­ther in­vest­ments in AI may in­val­i­date this con­cern (not fully in­ves­ti­gated)

  • Re­search goals may be ambitious

    • The need around more ev­i­dence for cer­tain types of ther­a­peu­tic in­ter­ven­tion is well-es­tab­lished. How­ever it’s un­clear how use­ful it is to ex­trap­o­late ev­i­dence from a bot and ap­ply it to a hu­man-to-hu­man in­ter­ven­tion

    • Even if this ev­i­dence is (at least partly) rele­vant, we don’t yet know whether the sci­en­tific com­mu­nity is ready to ac­cept such evidence

    • Real­is­ti­cally, I don’t ex­pect we will see hard re­sults (e.g. pub­lished re­search pa­pers) for some years, and out­comes with a longer time pe­riod be­tween the time when fund­ing is pro­vided and the out­come oc­curs carry more risk

How would ex­tra funds be used?

At first the most press­ing needs are around:

  1. A spe­cial­ist with NLP (Nat­u­ral Lan­guage Pro­cess­ing) skills

  2. More fron­tend skills, es­pe­cially de­sign and UX

  3. A strong fron­tend de­vel­oper, prefer­ably with full stack skills

  4. A rigor­ous eval­u­a­tion of the bot

  5. More google ads

The pur­pose of this post is to gather feed­back on whether the pro­ject has suffi­cient po­ten­tial to war­rant fund­ing it (which is a high-level ques­tion), so for brevity this sec­tion does not set out a de­tailed bud­get.

Vol­un­teer­ing opportunities

We cur­rently would benefit from the fol­low­ing types of vol­un­teer:

  • Front-end graphic de­sign and UX volunteer

    • Skills needed: ex­pe­rience in graphic de­sign and UX

    • Could be split into two roles

  • Front-end developer

    • Skills needed: con­fi­dence with javascript, html, css

    • Fa­mil­iar­ity with how the front end in­ter­acts with the back­end via the Python Flask frame­work would be a bonus

  • Back-end developer

    • Skills needed: fa­mil­iar­ity with Python

    • NLP or AI knowl­edge desirable

  • Psy­chol­ogy advisor

    • Re­quired back­ground: pro­fes­sional, qual­ified psy­chol­o­gist or psy­chi­a­trist, prefer­ably with plenty of past/​on­go­ing clini­cal ex­pe­rience, prefer­ably with a Roge­rian background

  • Psy­chol­ogy re­search advisor

    • Re­quired back­ground: aca­demic as­sess­ing psy­cholog­i­cal in­ter­ven­tions, prefer­ably with ex­pe­rience of as­sess­ing hard-to-as­sess in­ter­ven­tions such as Roge­rian or ex­is­ten­tial therapies

Our ex­ist­ing team and/​or the peo­ple we are talk­ing to do already cover all of these ar­eas to a cer­tain de­gree, how­ever fur­ther help is needed on all of these fronts.

Appendices

The ap­pen­dices are in the fol­low­ing sec­tions:

  1. More info about how the bot works

  2. How this bot might rev­olu­tion­ise the ev­i­dence base for some therapies

  3. His­tory of the bot

  4. More de­tails of the cost-effec­tive­ness modelling

  5. Overview of other men­tal health apps

Ap­pendix 1a: How the bot works

The bot sources its users via google ads from peo­ple who are search­ing for things like “I’m feel­ing de­pressed”.

  • The bot starts by ask­ing users how they feel on a scale from 1 to 10 (see later ap­pendix about im­pact mea­sures for more on how this ques­tion is worded)

  • The bot starts to ini­ti­ate the con­ver­sa­tion (to give users a lit­tle feel of what to ex­pect)

  • The bot ex­plains some things (e.g. mak­ing sure that the user knows that they are talk­ing to a bot, not a hu­man, gives them a choice about whether we can use their con­ver­sa­tion data, and asks the user to say how they feel at the end too)

  • The bot en­ters the main (or free text) sec­tion, where the user ex­plores their feel­ings. In the origi­nal pro­to­type (or “MVP”) of the bot, the re­sponses were all sim­ple generic re­sponses say­ing some­thing like “I hear you”, “I’m listen­ing”, etc. This sim­ple bot gen­er­ated a net pos­i­tive im­pact on users. The bot is now much more so­phis­ti­cated: sev­eral more tai­lored re­sponses are pro­grammed in. And where a tai­lored re­sponse isn’t found, the fal­lback is to use a generic re­sponse.

  • The user clicks on a but­ton (or some­thing slightly differ­ent de­pend­ing on the front end) to in­di­cate that the con­ver­sa­tion is at an end, and they say how they’re feel­ing at the end, and have an op­por­tu­nity to give feedback

The philos­o­phy be­hind how the bot chooses its re­sponses can be sum­marised as fol­lows:

  • The bot is hum­ble: it doesn’t claim to know more about the user’s life than the user does

  • The bot is non-judgemental

  • The bot is hon­est: e.g. the bot not only tells users that they are talk­ing to a bot, but users have to click on some­thing which says “I will talk to you even though you are a sim­ple bot” to make sure there is no ambiguity

  • The bot re­spects users’ wishes around data pri­vacy: the bot spends some time care­fully ex­plain­ing the choices the user has (i.e. to­tally con­fi­den­tial—no­body ever sees what they wrote or anony­mous—our “boffins” can see the user’s words, but we don’t know who they are)

  • The bot gives em­pa­thetic responses

  • The bot doesn’t try to diminish what the user says (e.g. by tel­ling them that they should just cheer up, or ev­ery­thing is all OK re­ally)

  • The bot aims to achieve change in the user’s emo­tional state by let­ting the user ex­press what’s on their mind

  • Allow­ing the user to ex­press what’s on their mind in­cludes al­low­ing them to ex­press things that they can­not ex­press to peo­ple in their lives (such as their suici­dal thoughts). The bot may ex­plain that (e.g.) it is only a sim­ple bot and there­fore un­able to call an am­bu­lance for you, how­ever it won’t get shocked or scared by users talk­ing about suicide.

  • The bot does not guide the user to change the world around the user (e.g. by pro­vid­ing the user with ad­vice on what they should do). In­stead the bot pro­vides a space where the user can be listened to. If this leads to the user feel­ing emo­tion­ally bet­ter, and then choos­ing to make their own de­ci­sions about what they should do next, then so much the bet­ter.

If you want to get a feel for what the bot is like, here are some record­ings of the bot:

TIO ex­am­ple con­ver­sa­tion us­ing blue/​grey frontend

TIO ex­am­ple con­ver­sa­tion us­ing boot­strap frontend

TIO ex­am­ple con­ver­sa­tion us­ing Guided Track frontend

You are also wel­come to in­ter­act with the chat­bot, how­ever note that the bot’s re­sponses are geared to­wards *ac­tual* users; peo­ple who aren’t in emo­tional need at the time of us­ing the bot are likely to have a more dis­ap­point­ing ex­pe­rience. Here’s a link. Note that this is a test ver­sion of the bot be­cause I want to avoid con­tam­i­nat­ing our data with test­ing users. How­ever if you are gen­uinely feel­ing low (i.e. you are a “gen­uine” user of the ser­vice) then you are wel­come to use the bot here.

Ap­pendix 1b: Does the bot use AI?

Depends on what you mean by AI.

The bot doesn’t use ma­chine learn­ing, and we have a cer­tain ner­vous­ness about in­cor­po­rat­ing a “black box” with poor ex­plain­abil­ity.

We in­stead are cur­rently us­ing a form of pat­tern match­ing where the bot looks for cer­tain strings of text in the user’s mes­sage (this de­scrip­tion some­what sim­plifies the re­al­ity).

At time of writ­ing we are in­ves­ti­gat­ing “Nat­u­ral Lan­guage Pro­cess­ing” (NLP) libraries, such as NLTK and Spacy. We are cur­rently ex­pect­ing to ramp up the amount of AI we use.

Un­der some defi­ni­tions of AI, even rel­a­tively sim­ple chat­bots count as AI, so this chat­bot would cer­tainly count as AI un­der those defi­ni­tions.

Ap­pendix 1c: Sum­mary of im­pact scores

The im­pact scores shown here are calcu­lated as:

How the user was feel­ing on a scale from 1 to 10 at the end of the conversation

Minus

How the user was feel­ing on a scale from 1 to 10 at the start of the conversation


To­tal num­ber of users is >10,000, but not all users state how they feel at the end, so the num­ber of dat­a­points (n) feed­ing into the above table is only c3300.

  • The first two rows re­late to time pe­ri­ods when the team was smaller, so we acted more slowly. Hence the larger sam­ple sizes.

  • The bot has three front ends. The blue sec­tion shows the scores re­lat­ing to the over­all pic­ture (i.e. a com­bi­na­tion of all three front ends). The pale red sec­tion shows the best perform­ing front end.

The story ap­par­ently be­ing told in the above table is that each of the above changes im­proved the bot, but the last change (im­prov­ing the de­sign) was a step back­wards. How­ever we performed some statis­ti­cal anal­y­sis, and we can’t be con­fi­dent that this story is cor­rect.

  • What we can say with con­fi­dence: the bot does bet­ter than a null hy­poth­e­sis of zero im­pact (p-val­ues steadily be­low 1%, and of­ten many or­ders of mag­ni­tude be­low 1% for a t-test)

  • What we can’t say with con­fi­dence: we don’t know whether a bot ver­sion on one row re­ally is bet­ter or worse than the pre­vi­ous row; p-val­ues on a two-sam­ple t-test com­par­ing one row of the above table with the pre­vi­ous row are con­sis­tently dis­ap­point­ing (nor­mally >10%).

The main thing hold­ing us back is money: get­ting larger sam­ple sizes in­volves pay­ing for more google ads; if we could do this we would be able to draw in­fer­ences more effec­tively.

Ap­pendix 1d: How good are these im­pact scores?

The changes in hap­piness scores in the above table are mostly in the range 0.3 out of 10 (for the ear­liest pro­to­type) to 0.99 out of 10.

Is this big or small?

If we look at this EA Fo­rum post by Michael Plant, it seems, at first glance, that the bot’s im­pacts are mod­er­ately big. The table sug­gests that achiev­ing changes as large as 1 point or more on a scale from 1 to 10 is difficult. The table shows that a phys­i­cal ill­ness has an im­pact of 0.22 out of 10, and hav­ing de­pres­sion has an im­pact of 0.72 out of 10. How­ever note that the ques­tion word­ing was differ­ent, so it doesn’t nec­es­sar­ily mean that those are on a like-for-like ba­sis.

(Thank you to Michael for di­rect­ing me to this in the call I had with him on the topic)

Us­ing a more statis­ti­cal ap­proach to an­swer­ing the ques­tion finds that the Co­hen’s d typ­i­cally falls in the range 0.2 to 0.5. Co­hen’s d is a mea­sure which com­pares the out­come with the stan­dard de­vi­a­tion. A Co­hen’s d of 0.2 typ­i­cally in­di­cates a mod­est effect size; a Co­hen’s d of 0.5 in­di­cates a medium effect size, and Co­hen’s d of 0.8 in­di­cates a large effect size.

Ap­pendix 1e: Distri­bu­tion of re­sponses to the bot

Users’ re­sponses to the bot ex­pe­rience are varied.

The be­low is an ex­am­ple his­togram show­ing the users’ im­pact scores (i.e. fi­nal score minus ini­tial score) for one var­i­ant of the bot.

If we ig­nore the big dol­lop of users whose im­pact score is zero out of 10, the shape seems to in­di­cate a roughly sym­met­ri­cal dis­tri­bu­tion cen­tred roughly around 1 out of 10.

Note that the big dol­lop of zero scores may be linked to the fact that the ques­tion word­ing “leads” or “an­chors” the user to­wards a zero score (see sep­a­rate ap­pendix about the im­pact mea­sures for more on this).

There are also a few ex­treme val­ues, with more pos­i­tive ex­treme val­ues than nega­tives.

I have read all of the con­ver­sa­tions which have hap­pened be­tween users and the bot (that the users have been will­ing to share). I syn­the­sise here my im­pres­sions based on those con­ver­sa­tions and the feed­back pro­vided.

Ex­tremely pos­i­tive re­ac­tions:

  • The bot al­lows users to ex­press things that they wouldn’t feel able to say to any human

  • The bot cre­ates a sense of “safety” which doesn’t ap­ply in con­texts which ap­ply more so­cial pres­sure, such as in­ter­act­ing with an­other human

  • It ap­pears that know­ing *ex­actly* what hap­pens with the data is prob­a­bly important

  • The “on-de­mand” na­ture of the bot seemed rele­vant (this more my in­fer­ence than some­thing said ex­plic­itly)

  • Users some­times ex­press sur­prise that the con­ver­sa­tion with the bot is as ther­a­peu­tic as it is

  • A sur­pris­ingly large num­ber of users de­scribed the bot with su­per­la­tive terms such as “it’s perfect” and “don’t change any­thing about it!”

  • The most pos­i­tive re­ac­tions of­ten seemed to in­volve some el­e­ment of (con­scious) Eliza effect

Neu­tral to mildly pos­i­tive re­ac­tions:

  • Th­ese seemed to in­clude some fea­tures of the more pos­i­tive and the more nega­tive reactions

  • Some com­ments noted that the bot is im­perfect, but bet­ter than noth­ing (mir­ror­ing our think­ing when we set up the pro­ject)

  • Users of­ten asked for more so­phis­ti­ca­tion in the bot’s re­sponses (es­pe­cially ear­lier when the bot’s re­sponses were ex­tremely un­so­phis­ti­cated)

Nega­tive re­ac­tions:

  • Some re­ac­tions were nega­tive about the fact that the user was talk­ing to a bot, ar­gu­ing that they could never get the sup­port they need with­out speak­ing a human

  • Re­lat­edly, some users felt that talk­ing to a bot just high­lighted how lonely they were

  • Others thought that the bot was too simplistic

  • Some users had re­sponses that were fa­mil­iar from the Sa­mar­i­tans con­text, in­clud­ing dis­ap­point­ment at not re­ceiv­ing ad­vice, and the idea that ex­press­ing your feel­ings could ac­tu­ally be unhelpful

Ap­pendix 1f: Im­pact mea­sures: the “hap­piness” question

To as­sess our im­pact, we ask the user how they feel at the start and at the end of the conversation

At the start of the con­ver­sa­tion, the bot asks the user:

“Please rate how you feel on a scale from 1 to 10, where 1 is ter­rible and 10 is great”

When the user ini­ti­ates the fi­nal feed­back ques­tion, they see this text:

“Thank you for us­ing this bot. Please rate how you feel on a scale from 1 to 10, where 1 is ter­rible and 10 is great. As a re­minder, the score you gave at the start was <INITIAL_SCORE>”

Con­sid­er­a­tions:

  • Is the an­chor­ing effect de­sir­able?

    • The word­ing in the fi­nal ques­tion re­minds the user which score they gave at the start

    • I sus­pect this is prob­a­bly good—it means that if the user re­ally is feel­ing bet­ter/​worse than they felt be­fore, then they can cal­ibrate their an­swer accordingly

    • How­ever you could ar­gue the con­trary: maybe a purer and bet­ter an­swer would be given if the an­swer were given with­out an­chor­ing?

  • Con­sis­tency with other research

    • As my years of ex­pe­rience in re­search have taught me, peo­ple can be sur­pris­ingly re­spon­sive to the small­est de­tails of how a ques­tion is worded

    • For com­pa­ra­bil­ity across differ­ent re­search, it’s prefer­able to use con­sis­tent wording

    • An early pro­to­type tried us­ing the same word­ing used by Sa­mar­i­tans in their re­search. How­ever be­cause the Sa­mar­i­tans re­search has been de­liv­ered ver­bally as part of a phone call, there wasn’t a sin­gle/​canon­i­cal word­ing. How­ever I tried us­ing word­ing that fol­lowed the same ap­proach. Re­cent Sa­mar­i­tans stud­ies have used a dis­tress scale where dis­tress was rated on a scale of 0-10 (0= no dis­tress, 5= mod­er­ate dis­tress and 10= ex­treme dis­tress)

    • When I im­ple­mented this, I found that there was of­ten a sur­pris­ingly low amount of dis­tress for some users, which seemed to in­di­cate that users were get­ting con­fused and think­ing that 10 out of 10 is good and 0 out of 10 is bad.

    • Hence the bot now uses a differ­ent word­ing.

    • I didn’t find an­other word­ing used by an­other group which works this way round (and is only one ques­tion long)

  • Con­sis­tency with what we’ve done before

    • Now that we have had over 10,000 users, if we were to change the word­ing of our ques­tion, it would be a sig­nifi­cant de­par­ture from what we’ve done be­fore, and would have im­pli­ca­tions for com­pa­ra­bil­ity.

Ap­pendix 2: How this bot might rev­olu­tion­ise the ev­i­dence base

The source for much of this sec­tion is con­ver­sa­tions with ex­ist­ing pro­fes­sional psy­chi­a­trists/​psy­chol­o­gists.

Cur­rently some psy­cholog­i­cal in­ter­ven­tions are sub­stan­tially bet­ter ev­i­denced than oth­ers.

  • E.g. CBT (Cog­ni­tive Be­havi­oural Ther­apy) has a re­ally strong ev­i­dence base

  • Other ther­a­peu­tic ap­proaches such as Roge­rian or Ex­is­ten­tial ther­a­pies don’t have a strong ev­i­dence base

Most psy­chol­o­gists would not ar­gue that we should there­fore dis­miss these poorly ev­i­denced ap­proaches.

In­stead the short­age of ev­i­dence re­flects the fact that this sort of ther­apy typ­i­cally doesn’t have a clearly defined “play­book” of what the ther­a­pist says.

Part of the aim of this pro­ject is to ad­dress this in two ways:

  1. Pro­vid­ing a uniform in­ter­ven­tion that can be as­sessed at scale

    1. Ex­am­ple: imag­ine a to­tally uniform in­ter­ven­tion (such as giv­ing pa­tients a prozac pill, which is to­tally con­sis­tent from one pill to the next)

    2. A to­tally uniform in­ter­ven­tion (like Prozac) is eas­ier to assess

    3. Some ther­a­peu­tic ap­proaches (like CBT) are closer to be­ing uniform (al­though, de­pend­ing on how you im­ple­ment them, some­times CBT can be more or less uniform)

    4. Others, like Roge­rian or ex­is­ten­tial ther­a­pies, are highly non-uniform—they don’t have a clear “play­book”

    5. This means that we don’t have good ev­i­dence around the effec­tive­ness of these sorts of therapy

    6. This chat­bot pro­ject aims to im­ple­ment a form of Roge­rian-es­que ther­apy that *is* uniform, al­low­ing as­sess­ment at scale

  2. Allow­ing an ex­per­i­men­tal/​sci­en­tific ap­proach which could provide an ev­i­dence base for therapists

    1. At the mo­ment, there is a short­age of ev­i­dence about *speci­fi­cally* to say in a given context

    2. To take an ex­am­ple which would be con­tro­ver­sial within Sa­mar­i­tans:

    3. Imag­ine that a ser­vice user is talk­ing about some health is­sues. Con­sider the fol­low­ing pos­si­ble re­sponses that could be said by a ther­a­pist/​listen­ing volunteer

      1. “You must recog­nise that these health is­sues might be se­ri­ous. Please make sure you go to see a doc­tor. Will you do that for me please?”

      2. “How would you feel about see­ing a doc­tor?”

      3. “How are you feel­ing about the health is­sues you’ve de­scribed?”

    4. In a Sa­mar­i­tans con­text, the first of these would most likely be con­sid­ered overly di­rec­tive, the third would likely be con­sid­ered “safe” (at least with re­gard to the risk of be­ing overly di­rec­tive), and the mid­dle op­tion would be con­tro­ver­sial.

    5. Cur­rently, these de­bates op­er­ate in an ev­i­dence vacuum

    6. Ad­ding ev­i­dence to the de­bate need not nec­es­sar­ily fully re­solve the de­bate, but would al­most cer­tainly take the de­bate for­ward.

I be­lieve that this benefit on its own may be suffi­cient to jus­tify fund­ing the pro­ject (i.e. even if there were no short term benefit). This re­flects the fact that other, bet­ter-ev­i­denced in­ter­ven­tions don’t work for ev­ery­one (e.g. CBT might be great for per­son A, but in­effec­tive for per­son B), which means that CBT can’t solve the men­tal health prob­lem on its own, so we need more ev­i­dence on more types of ther­apy.

Cru­cially, TIO is fun­da­men­tally differ­ent from other men­tal health apps—it has a free-form con­ver­sa­tional in­ter­face, similar to an ac­tual con­ver­sa­tion (un­like other apps which ei­ther don’t have any con­ver­sa­tional in­ter­face at all, or have a fairly re­stricted/​”guided” con­ver­sa­tional ca­pa­bil­ity). This means that TIO is uniquely well-po­si­tioned to achieve this goal.

Note that there is still plenty of work needed be­fore the bot can make this sort of con­tri­bu­tion to the ev­i­dence base. Speci­fi­cally, a num­ber of other men­tal health apps have been sub­jected to rigor­ous eval­u­a­tions pub­lished in jour­nals, whereas TIO—still an early-stage pro­ject—has not.

Ap­pendix 3: His­tory of the bot

2018: I cre­ated the first pro­to­type, which was called the Feel­ings Feline. The bot took on the per­sona of a cat. This was partly to ex­plain to the user why the bot’s re­sponses were sim­plis­tic, and also be­cause of anec­do­tal ev­i­dence of the ther­a­peu­tic value of con­ver­sa­tions with pets. User feed­back sug­gested that with the in­cred­ibly ba­sic de­sign used, this didn’t work. I sus­pect that with more care, this could be a suc­cess, how­ever it would need bet­ter de­sign ex­per­tise than I had at the time (and I had al­most zero de­sign ex­per­tise at the time).

Feed­back was slow at this stage. Be­cause of a lack of tech­ni­cal ex­per­tise, there was no mechanism for gath­er­ing feed­back (al­though this is straight­for­ward to de­velop for any full stack web de­vel­oper). How­ever an­other glitch in the pro­cess did (ac­ci­den­tally) lead to feed­back. The glitch was be­cause of an un­fore­seen fea­ture of wix web­sites. Wix is an on­line ser­vice that al­lows peo­ple to cre­ate web­sites. I had cre­ated a wix web­site to ex­plain the con­cept of the bot. It then pointed to an­other web­site which I had pro­grammed in javascript. How­ever wix au­to­mat­i­cally in­cludes a chat in­ter­face in its web­sites (which doesn’t show up in the pre­view, so I didn’t know it ex­isted when I launched the site) . This led to con­fu­sion—many users started talk­ing to a chat in­ter­face which I hadn’t pro­grammed and didn’t even know ex­isted! This led to a num­ber of mes­sages com­ing through to my in­box. I re­sponded to them—I felt I had to be­cause many of them ex­pressed great dis­tress. In the course of the con­ver­sa­tions, I ex­plained about the chat­bot (again, I had to, to clear up the con­fu­sion). Although this was not the in­ten­tion, it meant that some users gave me their thoughts about the web­site, and some of those did so in a fairly high-qual­ity “qual­i­ta­tive re­search” way. Once I worked out what was go­ing on, I cleared up the web­site con­fu­sion, and the high qual­ity feed­back stopped.

2019: Another EA (Peter Bri­et­bart) was work­ing on an­other men­tal health app; he in­tro­duced me to Guided Track. Huge thanks to Peter Bri­et­bart for do­ing this; it was in­cred­ibly use­ful. I can’t un­der­state how helpful Peter was. Guided Track pro­vided a solu­tion to the prob­lems around gath­er­ing feed­back, all through a clean, pro­fes­sional-look­ing front-end, to­gether with an in­cred­ibly easy to pro­gram back-end.

By this stage, the team con­sisted of me and two other peo­ple that I had re­cruited through my net­work of Sa­mar­i­tans vol­un­teers.

2020: The team has now grown fur­ther, in­clud­ing some non-Sa­mar­i­tan vol­un­teers. My at­tempts to re­cruit vol­un­teers who are ex­pert on UX (user ex­pe­rience) and de­sign have not been suc­cess­ful. We also have a stronger net­work of peo­ple with a psy­chol­ogy back­ground and have sur­passed 10,000 users.

Ap­pendix 4a: De­tailed de­scrip­tion of as­sump­tions in cost effec­tive­ness model

This sec­tion runs through each of the as­sump­tions in the cost effec­tive­ness model.

  • Fixed costs per an­num: Based on hav­ing seen the bud­gets of some early stage tech star­tups, the num­bers used in the model seem like a rea­son­able bud­get for some­thing mod­er­ately early stage with rel­a­tively un­com­pli­cated tech. Ac­tu­ally, the tech may be com­pli­cated, but there’s the offset­ting ad­van­tage that lots of the work has already been done. Note also that this is in­tended as a slightly mid-term as­sump­tion, rather than a bud­get for the first year.

  • Vari­able costs, per user: The op­ti­mistic bud­get of $0.14 comes from ac­tual past ex­pe­rience (we have had pe­ri­ods of spend­ing £0.11 per user; al­though cur­rently we are rely­ing on free ads from a char­ity). Cost per click (CPC) seems un­likely to get much higher than $2 (this is also the up­per bound that is en­forced un­der Google for non­prof­its). The model also as­sumes that costs won’t in­crease ma­te­ri­ally from op­er­at­ing at scale be­cause of high un­met de­mand for help from those feel­ing low. The costs should be kept down some­what by sourc­ing some of the ads via a char­ity en­tity.

  • No of users per an­num: Reach­ing the pes­simistic figure of 50,000 users via google ads does not seem difficult:

  • The bot is cur­rently get­ting c 60 users per day (i.e. c20,000 per an­num) us­ing just free ads via google for nonprofits

  • Google for non­prof­its is highly re­stric­tive, in­clud­ing a num­ber of ar­bi­trary re­stric­tions. If we sup­ple­ment this with paid ads, we should ex­pect many more users

  • The bot was re­ceiv­ing al­most as many ads pre­vi­ously when we used paid ads at a rate of £3 per day, and this was limited to the UK only, and we still had plenty of room for more fund­ing.

  • Ex­pand­ing ge­o­graph­i­cally to other English-speak­ing ge­ogra­phies will al­low much more scale

  • My in­tu­ition that we can reach many more peo­ple comes from the im­pres­sion that men­tal health is­sues are wide­spread, that peo­ple feel able to en­ter search queries into google when they might not be will­ing to ask their clos­est friends, and that this trend is grow­ing. Hence an op­ti­mistic as­sump­tion which is sub­stan­tially larger (50 mil­lion), which doesn’t seem un­rea­son­able given that 450 mil­lion peo­ple around the world suffer from men­tal health is­sues (source: WHO) And I chose a re­al­is­tic as­sump­tion equal to the ge­o­met­ric mean of the pes­simistic and op­ti­mistic.

  • How­ever in case the as­sump­tions about scale were overop­ti­mistic, I tried tweak­ing the re­al­is­tic as­sump­tion so that the no of users was sub­stan­tially lower (only 200,000) and it didn’t change the con­clu­sion at the bot­tom of the table.

  • To­tal cost, per user: Calc

  • Bench­mark im­prove­ment in “hap­piness” on 27 point scale (PHQ-9): Sev­eral past StrongMinds re­ports (e.g. this one) show an av­er­age im­prove­ment in PHQ-9 of 13 points (see bot­tom of page 2)

  • Bench­mark du­ra­tion in “hap­piness” im­prove­ment (years): StrongMinds re­ports in­di­cate that they test again af­ter 6 months. If some peo­ple have been checked af­ter 6 months, how long should we as­sume that the im­prove­ment lasts? A rea­son­able as­sump­tion is that it will last for an­other 6 months af­ter that, so 1 year in to­tal.

  • Bench­mark Cost per pa­tient: StrongMinds tracks their cost per pa­tient, and for the most re­cent re­port pre-COVID, this came to $153 for the StrongMinds-led groups, and $66 for the peer-led groups (i.e. the groups where a past grad­u­ate of their groups runs a new group, rather than a StrongMinds staff mem­ber). We then di­vided this by 75%, be­cause we want the cost per pa­tient ac­tu­ally cured of de­pres­sion, and 75% is the StrongMinds tar­get.

  • TIO: Im­prove­ment in self-re­ported “hap­piness” on 1-10 scale: Pes­simistic score of 0.9 comes from as­sum­ing that we sim­ply take the cur­rent bot, fo­cus on the best-perform­ing front-end, per­haps back out the re­cent de­sign changes if nec­es­sary, and then we have a bot which seems to be gen­er­at­ing 0.99 (see scores noted ear­lier). If, in ad­di­tion, we give credit to fu­ture tar­get­ing of users and fu­ture im­prove­ments in the bot, we should ex­pect the im­pact score to be higher. The 2.8 as­sump­tion comes from tak­ing the data from the cur­rent bot and tak­ing the av­er­age of pos­i­tive scores (i.e. as­sume that we can effec­tively tar­get users who will benefit, but as­sume no fur­ther im­prove­ments in the bot’s effec­tive­ness). To choose an op­ti­mistic figure, I calcu­lated the num­ber such that 2.8 would be the ge­o­met­ric mean—this came to 8.7, which I thought was too high, so I nudged it down to 7.

  • TIO: du­ra­tion of im­prove­ment (days) -- for a sin­gle us­age of the bot: We know that some peo­ple feel bet­ter af­ter us­ing the bot, but we don’t know how long this will per­sist. In the ab­sence of ev­i­dence, this is re­ally a mat­ter of guess­work. All the more so, since my im­pres­sion from hav­ing read the con­ver­sa­tions is that the ex­tent to which the effect per­sists prob­a­bly varies. The as­sump­tions cho­sen here were be­tween a few hours (i.e. quar­ter of a day) to a week. The re­al­is­tic as­sump­tion is the ge­o­met­ric mean of the op­ti­mistic and pes­simistic.

  • TIO: pro­por­tion of “reusers”; i.e. those who reuse the bot: Thus far, there have already been some in­stances of users in­di­cat­ing that they have liked the ex­pe­rience of us­ing the bot and they will book­mark the site and come back and use it again, and we have cur­rently made no effort to en­courage this. It there­fore seems rea­son­able to be­lieve that if we tried to en­courage this more, it would hap­pen more, and this would im­prove the cost-effec­tive­ness (be­cause we wouldn’t have to pay for more google ads, and the google ads are a ma­te­rial part of the cost base un­der the medium term sce­nario). How­ever we also need to man­age a care­ful bal­ance be­tween three things: (a) reuse is po­ten­tially good for a user and good for the pro­ject’s cost-effec­tive­ness (b) de­pen­dency man­age­ment (c) user pri­vacy. Depen­dency here refers to the risk (for which there is prece­dent) that users of such ser­vices can de­velop an un­healthy re­la­tion­ship with the ser­vice, al­most akin to an ad­dic­tion. We have not yet thought through how to man­age this.

  • TIO: as­sumed du­ra­tion of con­tin­u­ing to use the bot for reusers (months): On this point we are op­er­at­ing in the ab­sence of data, so the du­ra­tion has been es­ti­mated as any­where be­tween 2 weeks (pes­simistic) through to 1 year (op­ti­mistic; al­though there is no in-prin­ci­ple rea­son why the benefits could not per­sist for longer). The cen­tral as­sump­tion of 3 months was cho­sen as a rounded ge­o­met­ric mean of the op­ti­mistic and pes­simistic sce­nar­ios.

  • TIO: no of hap­piness-point-years: Calcu­lated as the pro­por­tion of users who sin­gle users x the du­ra­tion of im­pact for a sin­gle us­age + pro­por­tion of users who reusers x du­ra­tion of con­tin­u­ing to use the bot. This in­cludes the as­sump­tion that if some­one is a mul­ti­ple user, then the hap­piness im­prove­ment per­sists through­out the du­ra­tion of their reuse pe­riod. This is a gen­er­ous as­sump­tion (al­though no more gen­er­ous than the as­sump­tion be­ing made in the bench­mark). On the other hand, the reusers are as­sumed to have the same hap­piness im­prove­ment as sin­gle users, which is a very harsh as­sump­tion given that users have varied re­ac­tions to the bot, and those who are least keen on the bot seem less likely to reuse it.

  • TIO: cost per hap­piness-point-year: Calc

  • De­ci­sion crite­rion: Should we fund TIO? In the row in yel­low la­bel­led “Should we fund TIO?” there is a for­mula which de­ter­mines whether the an­swer is “Yes”, “No”, or “Ten­ta­tive yes”. Ten­ta­tive yes is cho­sen if the two cost-effec­tive­ness figures are close to each other, which is defined (some­what ar­bi­trar­ily) as be­ing within a fac­tor of 2 of each other. This is to re­flect the fact that if the cost effec­tive­ness comes out at a num­ber close to the bench­mark, then we should be ner­vous be­cause the model de­pends on a num­ber of as­sump­tions.

(Note that the com­par­i­son here is against the stan­dard StrongMinds in­ter­ven­tion, not the StrongMinds chat­bot.)

Op­ti­mistic and pes­simistic as­sump­tions:

I did not try to care­fully cal­ibrate the prob­a­bil­ities around the op­ti­mistic and pes­simistic sce­nar­ios, how­ever they are prob­a­bly some­thing like a 75%-90% con­fi­dence in­ter­val for each as­sump­tion.

Note fur­ther most of the as­sump­tions are es­sen­tially in­de­pen­dent of each other, mean­ing that a sce­nario as ex­treme as the op­ti­mistic or pes­simistic sce­nario is much more se­vere than a 90% con­fi­dence in­ter­val if we are look­ing at the over­all sce­nario.

Hav­ing said that, the main aim of in­clud­ing those sce­nar­ios was to high­light the fact that the model is based on a num­ber of un­cer­tain as­sump­tions about the fu­ture, and speci­fi­cally that the un­cer­tainty is suffi­cient to make the differ­ence be­tween be­ing close to the bench­mark and be­ing re­ally quite far from the bench­mark (in ei­ther di­rec­tion)

Ap­pendix 4b: PHQ-9 to 1-10 conversion

This bot mea­sures its effec­tive­ness based on a scale from 1 to 10. The com­para­tor (StrongMinds) uses a well-es­tab­lished stan­dard met­ric, which is PHQ-9. PHQ-9 is de­scribed in more de­tail in a sep­a­rate ap­pendix.

The calcu­la­tion used in the base sce­nario of the model is to take the 13-point move­ment in the 27-point PHQ-9 scale and sim­ply rescale it.

This is overly sim­plis­tic. As can be seen from the sep­a­rate ap­pendix about PHQ-9, PHQ-9 as­sesses sev­eral very differ­ent things, and is con­structed in a differ­ent way, so a straight rescal­ing isn’t likely to con­sti­tute an effec­tive trans­la­tion be­tween the two mea­sures.

To take an­other data point, this post by Michael Plant of the Hap­pier Lives In­sti­tute sug­gests that the effect of cur­ing some­one’s de­pres­sion ap­pears to be 0.72 on a 1-10 scale, us­ing data that come from Clark et al. The chat­bot cost-effec­tive­ness model in­cludes an al­ter­na­tive sce­nario which uses this con­ver­sion be­tween differ­ent im­pact scales, and it sug­gests that the chat­bot (TIO) out­performs the bench­mark by quite a sub­stan­tial bench­mark un­der this as­sump­tion.

To a cer­tain ex­tent this may seem wor­ry­ing—the out­put figures move quite con­sid­er­ably in re­sponse to an un­cer­tain as­sump­tion. How­ever I be­lieve I have used a fairly con­ser­va­tive as­sump­tion in the base sce­nario, which may give us some com­fort.

Ap­pendix 4c: Time conversion

The cost-effec­tive­ness model com­pares in­ter­ven­tions with differ­ent du­ra­tions.

  • Let’s as­sume that a hy­po­thet­i­cal short term in­ter­ven­tion im­proves user wellbe­ing for 1 day (say).

  • Let’s as­sume that a hy­po­thet­i­cal long term in­ter­ven­tion im­proves user wellbe­ing for one year (say)

  • And as­sume that the ex­tent of the wellbe­ing im­prove­ment is the same

  • Should we say that the in­ter­ven­tion which lasts 365 times as long is 365 times more valuable? (let’s call this the time-smooth­ing as­sump­tion)

I raised a ques­tion about this on the Effec­tive Altru­ism, Men­tal Health, and Hap­piness face­book group. (I sus­pect you may need to be a mem­ber of the group to read it.) It raised some dis­cus­sion, but no con­clu­sive an­swers.

In the cost-effec­tive­ness model I’ve used what I call the time-smooth­ing as­sump­tion.

It’s pos­si­ble that the time-smooth­ing as­sump­tion is too harsh on the bot (or equiv­a­lently too gen­er­ous to the longer-term in­ter­ven­tion). After all, if some­one’s de­pres­sion is im­proved af­ter the StrongMinds in­ter­ven­tion, it seems un­re­al­is­tic to be­lieve that the pa­tient’s wellbe­ing will be con­sis­tently, un­in­ter­rupt­edly im­proved for a year. That said, it also seems un­re­al­is­tic to be­lieve that the chat­bot will *always* cause the user to feel *con­sis­tently* bet­ter for a full day (or whichever as­sump­tion we are us­ing), how­ever in a short time pe­riod there is less scope for var­i­ance.

The cost-effec­tive­ness model in­cludes an al­ter­na­tive sce­nario which ex­plores this by ap­ply­ing a some­what ar­bi­trar­ily cho­sen fac­tor of 2 to re­flect the pos­si­bil­ity that the time-smooth­ing as­sump­tion used in the base case is too harsh on TIO.

How­ever my over­all con­clu­sion on this sec­tion is that the topic is difficult, and re­mains, to my mind, an area of un­cer­tainty.

Ap­pendix 4d: “Sup­port from google” scenario

The cost-effec­tive­ness model in­cludes a “Sup­port from Google” scenario

At the last gen­eral meet­ing of the chat­bot team, it was sug­gested that Google may be will­ing to sup­port a pro­ject such as this, and speci­fi­cally that this sort of pro­ject would ap­peal to Google suffi­ciently that they might be will­ing to sup­port the pro­ject above and be­yond their ex­ist­ing google for non­prof­its pro­gramme. I have not done any work to ver­ify whether this is re­al­is­tic.

There is a sce­nario in the cost-effec­tive­ness model for the pro­ject re­ceiv­ing this sup­port from Google, which in­volves set­ting the google ads costs to zero.

I would only con­sider this le­gi­t­i­mate if there were suffi­ciently favourable coun­ter­fac­tu­als.

  • If the Google sup­port came from the same source as Google dona­tions to GiveDirectly, for ex­am­ple, then I’d worry that the coun­ter­fac­tu­als on that money might be quite good out­comes, and I should still treat the money as a cost within the model

  • If the Google sup­port were ul­ti­mately a re­duc­tion in prof­its to share­hold­ers, this would likely be a much smaller op­por­tu­nity cost be­cause of diminish­ing marginal re­turns, and there­fore it would be rea­son­able for the model to not treat the google ad costs as a “real” cost

Ap­pendix 4e: Counterfactuals

For im­pact scores we have gath­ered thus far, we have as­sumed that the coun­ter­fac­tual is zero im­pact.

This sec­tion ex­plores this as­sump­tion, and con­cludes that with re­gard to the cur­rent mechanism for reach­ing users (via Google ads) the as­sump­tion might be gen­er­ous. How­ever, the bot’s ul­ti­mate im­pact is not limited to the cur­rent mechanism of reach­ing users, and given that men­tal health is, in gen­eral, un­der­sup­plied, it’s rea­son­able to be­lieve that other con­texts with bet­ter coun­ter­fac­tu­als may arise.

Here’s how it works at the mo­ment:

  • The user jour­ney starts with some­one googling “I’m feel­ing de­pressed” or similar

  • Our google ads ap­pear, and some peo­ple will click on those ads

  • If those ads had not been there what would have hap­pened in­stead?

To ex­plore this, I have tried googling “I’m feel­ing de­pressed”, and the search re­sults in­clude ad­vice from the NHS, web­sites pro­vid­ing ad­vice on de­pres­sion, and similar. (note that one com­pli­ca­tion is that search re­sults are not to­tally uniform from one user to the next)

The con­tent of those web­sites seems fine, how­ever I’ve heard com­ments from peo­ple say­ing things like

  • “I don’t have the ca­pac­ity to read through a long web­site like that—not hav­ing the emo­tional re­sources to do that is ex­actly what it means to be de­pressed!”

  • “The ad­vice on those web­sites is all ob­vi­ous. I *know* I’m sup­posed to eat healthy, get ex­er­cise, do mind­ful­ness, get enough sleep etc. It’s not *know­ing* what to do, it’s ac­tu­ally *do­ing* it that’s hard!”

Which sug­gests that those web­sites aren’t suc­ceed­ing in ac­tu­ally mak­ing peo­ple hap­pier. Hence the zero im­pact as­sump­tion.

How­ever, these im­pres­sions are *not* based on care­ful stud­ies. Speci­fi­cally, I’m gath­er­ing these im­pres­sions from peo­ple who have ended up talk­ing to me in some role in which I’m pro­vid­ing vol­un­teer sup­port to peo­ple who are feel­ing low. In such a con­text, there’s a risk of se­lec­tion effects: maybe the peo­ple who found such web­sites use­ful get help and there­fore don’t end up need­ing my sup­port?

Some im­por­tant ob­ser­va­tions about coun­ter­fac­tu­als in the spe­cific con­text of sourc­ing users from Google ads:

  • Assess­ing coun­ter­fac­tu­als is hard (in gen­eral, and in this case in par­tic­u­lar)

  • As the re­sources in the google search re­sults seem cred­ible, it seems un­likely that the coun­ter­fac­tual out­come would ac­tu­ally be *harm­ful*

  • So the as­sump­tion of zero im­pact used in the model errs on the side of be­ing favourable to the chat­bot’s cost-effectiveness

  • This is all the more wor­ry­ing, as the effect size of the bot is un­clear, but (at least based on the Co­hen’s d mea­sure) is cur­rently of­ten small (or at best medium). (Although it might be larger in the fu­ture, and looks larger based on other ways of as­sess­ing this—see sep­a­rate ap­pendix about this)

  • Note that the com­para­tor (StrongMinds) does clearly have favourable coun­ter­fac­tu­als (poor women in sub-Sa­haran Africa al­most cer­tainly *wouldn’t* get ac­cess to de­cent sup­port with­out StrongMinds)

While the coun­ter­fac­tu­als ap­pear to raise some ma­te­rial doubts within the Google ads con­text, the ul­ti­mate im­pact of the pro­ject need not be tied solely to that context

  • Just be­cause the pro­ject cur­rently sources users from Google ads, it doesn’t mean that it will only ever source users from Google ads.

  • Speci­fi­cally, fu­ture ex­pan­sion to the de­vel­op­ing world may have coun­ter­fac­tu­als which are more favourable to this pro­ject (as­sum­ing that the AI tech­niques can adapt across lan­guages).

  • Ul­ti­mately, how­ever, the over­all/​global un­der­sup­ply of men­tal health ser­vices is rele­vant. Even if we don’t know now ex­actly where the pro­ject will reach users with favourable coun­ter­fac­tu­als, the fact that men­tal health ser­vices are un­der­sup­plied in­creases the prob­a­bil­ity that such places can be found.

  • How­ever this is spec­u­la­tive.

Ap­pendix 4f: About PHQ-9

This is what a stan­dard PHQ-9 ques­tion­naire looks like:

There are 9 ques­tions, and a score out of 27 is gen­er­ated by as­sign­ing a score of 0 for each tick in the first column, a score of 1 for each tick in the sec­ond column, a score of 2 for each tick in the third column and a score of 3 for each tick in the fourth column, and then adding the scores up.

It is a stan­dard tool for mon­i­tor­ing the sever­ity of de­pres­sion and re­sponse to treat­ment.

Ap­pendix 5: Other men­tal health apps

We are very much not the first peo­ple to think of pro­vid­ing men­tal health help via an app.

A list of about 15 or so other apps can be found here.

Types of ther­a­peu­tic ap­proaches used a lot:

  • CBT

  • Meditation

Ap­proaches which don’t seem to come up of­ten:

  • Psychodynamic

  • Rogerian

  • Ex­is­ten­tial

We ap­pear to be tak­ing an ap­proach which is differ­ent from the ex­ist­ing men­tal health apps.

And yet if we look at at­tempts to solve the Tur­ing test, an early at­tempt was a chat­bot called Eliza, which was in­spired by the Roge­rian ap­proach to ther­apy (which is also the ap­proach which is clos­est to the Sa­mar­i­tans listen­ing ap­proach)

So it seemed sur­pris­ing that peo­ple try­ing to solve the Tur­ing test had tried to em­ploy a Roge­rian ap­proach, but peo­ple try­ing to tackle men­tal health had not.

To our knowl­edge, we are the first pro­ject tak­ing a Roge­rian-in­spired con­ver­sa­tional app and ap­ply­ing it for men­tal health pur­poses.

On a re­lated note, this pro­ject seems un­usual in in­clud­ing a rel­a­tively free flow­ing con­ver­sa­tional in­ter­face. While sev­eral other apps have a con­ver­sa­tional or chat­bot-like in­ter­face, these bots are nor­mally con­structed in a very struc­tured way, mean­ing that most of the con­ver­sa­tion has been pre­de­ter­mined. I.e. the bot sets the agenda, not the user. And in sev­eral apps, there is ac­tu­ally no free text fields at all.

We spec­u­late that the rea­son for this is the more open con­ver­sa­tional paradigm was too in­timi­dat­ing for other app de­vel­op­ers, who per­haps felt that solv­ing men­tal health *and* the Tur­ing Test at the same time was too am­bi­tious. Our ap­proach is dis­tinc­tive per­haps be­cause of the fact that we were in­spired by the Sa­mar­i­tans ap­proach, which is rel­a­tively close to an MVP of a ther­a­peu­tic ap­proach.

The fact that our in­ter­face is so free flow­ing is im­por­tant. It means that our bot’s ap­proach is clos­est to ac­tual real life ther­apy.