[based on an internally run study of 250 uses] Mind Ease reduces anxiety by 51% on average, and helps people feel better 80% of the time.
Extraordinary claims like this (and it’s not the only one—e.g. “very likely” to help myself or people who I know who suffer from anxiety elsewhere in the post, “And for anxiety [discovering which interventions work best] is what we’ve done, ’45% reduction in negative feelings’ in the app itself) demands much fuller and more rigorous description and justification. e.g. (and cf. PICO):
(Population): How are you recruiting the users? Mturk? Positly? Convenience sample from sharing the link? Are they paid for participation? Are they ‘people validated (somehow) as having an anxiety disorder’ or (as I guess) ‘people interested in reducing their anxiety/having something to help when they are particularly anxious?’
(Population): Are the “250 uses” 250 individuals each using Mindease once? If not, what’s the distribution of duplicates?
(Intervention): Does “250 uses” include everyone who fired up the app, or only those who ‘finished’ the exercise (and presumably filled out the post-exposure assessment)?
(Comparator): Is this a pre-post result? Or is this vs. the sham control mentioned later? (If so, what is the effect size on the sham control?)
(Outcome): If pre-post, is the postexp assessment immediately subsequent to the intervention?
(Outcome): “reduces anxiety by 51%” on what metric? (Playing with the app suggests 5-level Likert scales?)
(Outcome): Ditto ‘feels better’ (measured how?)
(Outcome): Effect size (51% from what to what?) Inferential stats on the same (SE/CI, etc.)
There are also natural external validity worries. If (as I think it is) the objective is ‘immediate symptomatic relief’, results are inevitably confounded by anxiety a symptom that is often transient (or at least fluctuating in intensity), and one with high rates of placebo response. An app which does literally nothing but waits a couple of days before assessing (symptomatic) anxiety again will probably show great reductions in self-reported anxiety on pre-post, as people will be preferentially selected to use the app when feeling particularly anxious, and severity will tend to regress. This effect could apply to much shorter intervals (e.g. those required to perform a recommended exercise).
(Aside: An interesting validity test would be using GAD-7 for pre-post assessment. As all the items on GAD-7 are ‘how often do you get X over the last 2 weeks’, significant reduction in this metric immediately after the intervention should raise alarm).
In candour (and with regret) this write-up raises a lot of red flags to me. There is a large relevant literature which this post does not demonstrate command of. For example, there’s a small hill of descriptive epidemiology papers on prevalence of anxiety as a symptom or anxiety disorders—including large population samples for GAD-7, which would look better routes to prevalence estimates than conducting a 300-person survey (and if you do run this survey, finding a prevalence in your sample of 73% >5 GAD given the population studies (e.g.) give means and medians ~2-3 and proportions >5 ~ 25% prompt obvious questions).
Likewise there are well-understood pitfalls in conducting research (some them particularly acute for intervention studies, and even moreso in intervention studies on mental health), which the ‘marketing copy’ style presentation (heavy on exuberant confidence, light on how this is substantiated) gives little reassurance they were in fact avoided. I appreciate “writing for an interested lay audience” (i.e. this one) demands a different style than writing to cater to academic scepticism. Yet the latter should be satisfied (either here or in a linked write-up), especially when attempting pioneering work in this area and claiming “extraordinarily good” results. We’d be cautious in accepting this from outside sources—we should mete out similar measure to projects developed ‘in house’.
I hope subsequent work proves my worries unfounded.
As I’m sure you are aware, this post had the goal of making people in the EA community aware of what we are working on, and why we are working on it, rather than attempting to provide rigorous proof of the effectiveness of our interventions.
One important thing to note is that we’re not aiming to treat long-term anxiety, but rather to treat the acute symptoms of anxiety to help people feel better quickly at moments when they need it. We measure anxiety immediately before the intervention, then the intervention runs, then we measure anxiety again (using three likert scale questions asked both immediately before and immediately after the intervention). At this point we have run studies testing many techniques for quickly reducing acute anxiety, so we know that some work much better than others.
I’ve updated the post with some edits and extra footnotes in response to your feedback, and here are some point by point responses:
How are you recruiting the users? Mturk? Positly?
We recruit paid participants for our studies via Positly.com (which pulls from Mechanical Turk automatically applying extra quality measures and providing us with extra researcher focussed features). Depending on the goals for a study we sometimes recruit broadly (from anyone who wants to participate), and other times specifically seek to recruit people with high levels of anxiety.
Are the “250 uses” 250 individuals each using Mindease once? If not, what’s the distribution of duplicates?
This data is from 49 paid study participants who used the app about 5 times total on average over a period of about 5 days (at whatever times they choose).
This particular study targeted users who experience at least some anxiety.
Does “250 uses” include everyone who fired up the app, or only those who ‘finished’ the exercise (and presumably filled out the post-exposure assessment)?
It’s based on only the people completed an intervention (i.e. where we had a pre and post measurement).
Is this a pre-post result? Or is this vs. the sham control mentioned later? (If so, what is the effect size on the sham control?)
This is a pre-post result. In one of our earlier studies we found the effectiveness of the interventions to be about 2x − 2.5x that of the control (13-17 “points” of pre-post mood change versus about 7 for the control). We’ve changed a lot about our methodology and interventions though, and don’t have measurements for the control yet with the new changes.
If pre-post, is the postexp assessment immediately subsequent to the intervention?
Yes. Our goal is to have the user be much calmer by the time they finish the intervention than they were when they started.
“reduces anxiety by 51%” on what metric? (Playing with the app suggests 5-level Likert scales?)
Using the negative feelings (not any positive feelings) reported on 3 likert scale questions. So people who reported no negative feelings at the beginning of the intervention are ignored for analysis since there are no negative feelings reported that we could remove.
Ditto ‘feels better’ (measured how?)
The 80% success rate refers to whenever a user’s negative feelings are reduced by any amount.
And thank you for telling me your honest reaction, your feedback has helped improve the post.
Extraordinary claims like this (and it’s not the only one—e.g. “very likely” to help myself or people who I know who suffer from anxiety elsewhere in the post, “And for anxiety [discovering which interventions work best] is what we’ve done, ’45% reduction in negative feelings’ in the app itself) demands much fuller and more rigorous description and justification. e.g. (and cf. PICO):
(Population): How are you recruiting the users? Mturk? Positly? Convenience sample from sharing the link? Are they paid for participation? Are they ‘people validated (somehow) as having an anxiety disorder’ or (as I guess) ‘people interested in reducing their anxiety/having something to help when they are particularly anxious?’
(Population): Are the “250 uses” 250 individuals each using Mindease once? If not, what’s the distribution of duplicates?
(Intervention): Does “250 uses” include everyone who fired up the app, or only those who ‘finished’ the exercise (and presumably filled out the post-exposure assessment)?
(Comparator): Is this a pre-post result? Or is this vs. the sham control mentioned later? (If so, what is the effect size on the sham control?)
(Outcome): If pre-post, is the postexp assessment immediately subsequent to the intervention?
(Outcome): “reduces anxiety by 51%” on what metric? (Playing with the app suggests 5-level Likert scales?)
(Outcome): Ditto ‘feels better’ (measured how?)
(Outcome): Effect size (51% from what to what?) Inferential stats on the same (SE/CI, etc.)
There are also natural external validity worries. If (as I think it is) the objective is ‘immediate symptomatic relief’, results are inevitably confounded by anxiety a symptom that is often transient (or at least fluctuating in intensity), and one with high rates of placebo response. An app which does literally nothing but waits a couple of days before assessing (symptomatic) anxiety again will probably show great reductions in self-reported anxiety on pre-post, as people will be preferentially selected to use the app when feeling particularly anxious, and severity will tend to regress. This effect could apply to much shorter intervals (e.g. those required to perform a recommended exercise).
(Aside: An interesting validity test would be using GAD-7 for pre-post assessment. As all the items on GAD-7 are ‘how often do you get X over the last 2 weeks’, significant reduction in this metric immediately after the intervention should raise alarm).
In candour (and with regret) this write-up raises a lot of red flags to me. There is a large relevant literature which this post does not demonstrate command of. For example, there’s a small hill of descriptive epidemiology papers on prevalence of anxiety as a symptom or anxiety disorders—including large population samples for GAD-7, which would look better routes to prevalence estimates than conducting a 300-person survey (and if you do run this survey, finding a prevalence in your sample of 73% >5 GAD given the population studies (e.g.) give means and medians ~2-3 and proportions >5 ~ 25% prompt obvious questions).
Likewise there are well-understood pitfalls in conducting research (some them particularly acute for intervention studies, and even moreso in intervention studies on mental health), which the ‘marketing copy’ style presentation (heavy on exuberant confidence, light on how this is substantiated) gives little reassurance they were in fact avoided. I appreciate “writing for an interested lay audience” (i.e. this one) demands a different style than writing to cater to academic scepticism. Yet the latter should be satisfied (either here or in a linked write-up), especially when attempting pioneering work in this area and claiming “extraordinarily good” results. We’d be cautious in accepting this from outside sources—we should mete out similar measure to projects developed ‘in house’.
I hope subsequent work proves my worries unfounded.
Hey Gregory,
Thanks for the in-depth response.
As I’m sure you are aware, this post had the goal of making people in the EA community aware of what we are working on, and why we are working on it, rather than attempting to provide rigorous proof of the effectiveness of our interventions.
One important thing to note is that we’re not aiming to treat long-term anxiety, but rather to treat the acute symptoms of anxiety to help people feel better quickly at moments when they need it. We measure anxiety immediately before the intervention, then the intervention runs, then we measure anxiety again (using three likert scale questions asked both immediately before and immediately after the intervention). At this point we have run studies testing many techniques for quickly reducing acute anxiety, so we know that some work much better than others.
I’ve updated the post with some edits and extra footnotes in response to your feedback, and here are some point by point responses:
We recruit paid participants for our studies via Positly.com (which pulls from Mechanical Turk automatically applying extra quality measures and providing us with extra researcher focussed features). Depending on the goals for a study we sometimes recruit broadly (from anyone who wants to participate), and other times specifically seek to recruit people with high levels of anxiety.
This data is from 49 paid study participants who used the app about 5 times total on average over a period of about 5 days (at whatever times they choose).
This particular study targeted users who experience at least some anxiety.
It’s based on only the people completed an intervention (i.e. where we had a pre and post measurement).
This is a pre-post result. In one of our earlier studies we found the effectiveness of the interventions to be about 2x − 2.5x that of the control (13-17 “points” of pre-post mood change versus about 7 for the control). We’ve changed a lot about our methodology and interventions though, and don’t have measurements for the control yet with the new changes.
Yes. Our goal is to have the user be much calmer by the time they finish the intervention than they were when they started.
Using the negative feelings (not any positive feelings) reported on 3 likert scale questions. So people who reported no negative feelings at the beginning of the intervention are ignored for analysis since there are no negative feelings reported that we could remove.
The 80% success rate refers to whenever a user’s negative feelings are reduced by any amount.
And thank you for telling me your honest reaction, your feedback has helped improve the post.