Thanks for making this public, found it really interesting to follow your train of thought. Also, despite hearing about it in the past, I had completely forgotten about Julia’s book. Added it to my reading list now. :)
How much time should a participant roughly allocate for this? How much time are we supposed to spend on each of the questions? For how many days/weeks/months will this be running?
Is “start by finding someone to practice with” something one should do before signing up, i.e. should people sign up in groups of 2? Or does that matching of participants happen once you’ve got enough together? If the latter, do you have control over which of the two roles you get?
I couldn’t yet make that much sense of the descriptions of what backcaster and retriever are doing exactly, specifically the “pick a date” part and how the date influences things.
What degree of forecasting experience are you looking for? Or all types of people? Would it make sense for people to sign up when they’ve gone through a lot of calibration training in the past?
And a side note, the first paragraph on the linked page seems to have been pasted twice.
+1 to that! Really cool, thanks for doing this. :)
Thanks a lot for the thorough post Emily! I like the framing of staying up late as a high-interest loan a lot. And I agree that reading Why We Sleep may indeed be quite useful for certain people, despite its shortcomings. You make a lot of good points and provide several interesting ideas, plus the post is written in a very readable way, and the drawings are great.
Not that much else to add, except two tiny nitpicks regarding your estimation:
you equated “being 30% less productive” with “taking 30% more time to complete things”, but actually being 30% less productive would mean you take 100⁄70 − 1 = ~42% longer. (a more obvious example of this would be that being 50% less productive means you require twice the time = 100% more, not 50% more)
concluding your estimation, your multiplication characters were interpreted as formatting, making the “0.254.9 + 0.502.1 + 0.10*0.4” part quite confusing to read. You could use × or • instead.
Same here. :)
Neat! Small mistake: “What is the probability that it will still be working after eight twenty years” should probably be “after twenty years”.
And multiple data points are exciting indeed!
Perfect! In the end the impact will of course be orders of magnitude higher, as a slightly better name of any particular organization will affect tens if not hundreds of thousands of people in the long run. And there may even be a tail chance of better names increasing the community’s stability and thus preventing collapse scenarios.
I think overall you really undersold your project with that guesstimate model only focusing on this post only, as if that was all there is to it.
I believe there are a few serious flaws in your guesstimate model:
a year has 365.2421905 days, not 365.25. That’s not even rounded correctly!
Smiles per QALY should multiply days in a year with smiles in a good day, instead they are added. They don’t even have the same unit, how can you add them! Insanity!
the post’s karma is far outside of even your 99% interval
Everything else seems quite correct and I agree with your CIs and conclusions.
Also, please find a new name for guesstimate.
Nice post! Found it through the forum digest newsletter. Interestingly I knew Lindy’s Law as the “Copernican principle” from Algorithms to Live By, IIRC. Searching for the term yields quite different results however, so I wonder what the connection is.
Also, I believe your webcomic example is missing a “1 -”. You seem to have calculcated p(no further webcomic will be released this year) rather than p(there will be another webcomic this year). Increasing the time frame should increase the probability, but given the formula in the example, the probability would in fact decrease over time.
“Bei 80% der Treffen der EA Münster Lokalgruppe in 2021 waren mehr als 5 Personen anwesend”—how will cancelled meetups (due to lack of attendees, if that ever happens) count into this? Not at all, or as <=5 attendees? (kind of reminds me of how the Deutsche Bahn decided to not count cancelled trains as delayed)
Also, coming from EA Bonn where our average attendance is ~4 people, I find the implications of this question impressive. :D
I see, so at the end of the day you’re assigning a number representing how productive the day was, and you consider predicting that number the day before? I guess in case that rating is based on your feeling about the day as opposed to more objectively predefined criteria, the “predictions affect outcomes” issue might indeed be a bit larger here than described in the post, as in this case the prediction would potentially not only affect your behavior, but also the rating itself, so it could have an effect of decoupling the metric from reality to a degree.
If you end up doing this, I’d be very interested in how things go. May I message you in a month or so?
Good point, I also make predictions about quarterly goals (which I update twice a month) as well as my plans for the year. I find the latter especially difficult, as quite a lot can change within a year including my perspective on and priority of the goals. For short term goals you basically only need to predict to what degree you will act in accordance with your preferences, whereas for longer term goals you also need to take potential changes of your preferences into account.
It does appear to me that calibration can differ between the different time frames. I seem to be well calibrated regarding weekly plans, decently calibrated on the quarter level, and probably less so on the year level (I don’t yet have any data for the latter). Admittedly that weakens the “calibration can be achieved quickly in this domain” to a degree, as calibrating on “behavior over the next year” might still take a year or two to significantly improve.
I personally tend to stick to the following system:
Every Monday morning I plan my week, usually collecting anything between 20 and 50 tasks I’d like to get done that week (this planning step usually takes me ~20 minutes)
Most such tasks are clear enough that I don’t need to specify any further definition of done; examples would be “publish a post in the EA forum”, “work 3 hours on project X”, “water the plants” or “attend my local group’s EA social” – very little “wiggle room” or risk of not knowing whether any of these evaluates to true or false in the end
In a few cases, I do need to specify in greater detail what it means for the task to be done; e.g. “tidy up bedroom” isn’t very concrete, and I thus either timebox it or add a less ambiguous evaluation criterion
Then I go through my predictions from the week before and evaluate them based on which items are crossed off my weekly to do list (~3 minutes)
“Evaluate” at first only means writing a 1 or a 0 in my spreadsheet next to the predicted probability
There are rare exceptions where I drop individual predictions entirely due to inability to evaluate them properly, e.g. because the criterion seemed clear during planning, but it later turned out I had failed to take some aspect or event into consideration, or because I deliberately decided to not do the task for unforeseeable reasons. Of course I could invest more time into bulletproofing my predictions to prevent such cases altogether, but my impression is that it wouldn’t be worth the effort.
After that I check my performance of that week as well as of the most recent 250 predictions (~2 minutes)
For the week itself, I usually only compare the expected value (sum of probabilities) with actually resolved tasks, to check for general over- or underconfidence, as there aren’t enough predictions to evaluate individual percentage ranges
For the most recent 250 predictions I check my calibration by having the predictions sorted into probability ranges of 0..9%, 10..19%, … 90..99%. and checking how much the average outcome ratio of each category deviates from the average of predictions in that range. This is just a quick visual check, which lets me know in which percentage range I tend to be far off.
I try to use both these results in order to adjust my predictions for the upcoming week in the next step
Finally I assign probabilities to all the tasks. I keep this list of predictions hidden from myself throughout the following week in order to minimize the undesired effect of my predictions affecting my behavior (~5 minutes)
These predictions are very much System 1 based and any single prediction usually takes no more than a few seconds.
I can’t remember how difficult this was when I started this system ~1.5 years ago, but by now coming up with probabilities feels highly natural and I differentiate between things being e.g. 81% likely or 83% likely without the distinction feeling arbitrary.
Depending on how striking the results from the evaluation steps were, I slightly adjust the intuitively generated numbers. This also happens intuitively as opposed to following some formal mathematical process.
While this may sound complex when explaining it, I added the time estimates to the list above in order to demonstrate that all of these steps are pretty quick and easy. Spending these 10 minutes each week seems like a fair price for the benefits it brings.
An example would be “make check up appointment with my dentist”, but when calling during the week realizing the dentist is on vacation and no appointment can be made; given there’s no time pressure and I prefer making an appointment there later to calling a different dentist, the task itself was not achieved, yet my behavior was as desired; as there are arguments to be made to evaluate this both as true or false, I often just drop such cases entirely from my evaluation ↩︎
I once had the task “sign up for library membership” on my list, but then during the week realized that membership was more expensive than I had thought, and thus decided to drop that goal; here too, you could either argue “the goal is concluded” (no todo remains open at the end of the week) or “I failed the task” (as I didn’t do the formulated action), so I usually ignore those cases instead of evaluating them arbitrarily ↩︎
One could argue that a 5% and a 95% prediction should really end up in the same bucket, as they entail the same level of certainty; my experience with this particular forecasting domain however is that the symmetry implied by this argument is not necessarily given here. The category of things you’re very likely to do seems highly different in nature from the category of things you’re very unlikely to do. This lack of symmetry can also be observed in the fact that 90% predictions are ~10x more frequent for me in this domain than 10% predictions. ↩︎
It’s 30 minutes total, but the first 20 are just the planning process itself, whereas the 3+2+5 afterwards are the actual forecasting & calibration training. ↩︎
“Before January 1st” in any particular time zone? I’ll probably (85%) publish something within the next ~32h at the time of writing this comment. In case you’re based in e.g. Australia or Asia that might then be January 1st already. Hope that still qualifies. :)
Indeed, thank you. :) I haven’t started the other, forecasting related one, but intend to spend some time on it next week and hopefully come up with something publishable before the end of the year.
My thoughts on how to best prepare for the workshop (as mentioned in the post):
Write down your expectations, i.e. what you personally hope to take away from the workshop (and if you’re fancy, maybe even add quantifications/probability estimates to each point)
Make sure you can go into the workshop with a clear head and without any distractions
Don’t make the same mistake I made, which was booking a flight home way too early on the day after the end of the workshop. I didn’t realize beforehand how difficult it was to get from the workshop venue to the airport, and figuring out a solution stressed me quite a bit during the week (but was in the end solved for me by the super kind ops people)
Do your best in the week(s) before to stay healthy
Sleep enough the nights before
Maybe prepare a bug list and take it with you; this will also be one of the first sessions, but the more the better
Don’t panic; if you don’t manage to prepare in any significant way, the workshop is still extremely well designed and you’ll do just fine.
Sure. Those I can mention without providing too much context:
calibrating on one’s future behavior by making a large amount of systematic predictions on a weekly basis
utilizing quantitative predictions in the process of setting goals and making plans
not prediction-related, but another thing your post triggered: applying the “game jam principle” (developing a complete video game in a very short amount of time, such as 48 hours) to EA forum posts and thus trying to get from idea to published post within a single day; because I realized writing a forum post is (for me, and a few others I’ve spoken to) often a multi-week-to-month endeavour, and it doesn’t have to be that way, plus there are surely diminishing returns to the amount of polishing you put into it
If anybody actually ends up planning to write a post on any of these, feel free to let me know so I’ll make sure focus on something else.
Good timing and great idea. Considering I’ve just read this: https://forum.effectivealtruism.org/posts/8Nwy3tX2WnDDSTRoi/announcing-the-forecasting-innovation-prize I’ll gladly commit to submitting at least one forum post to the forecasting innovation prize (precise topic remains to be determined). Which entails writing and publishing a post here or on lesswrong before the end of the year.
I further commit to publishing a second post (which I’d already been writing on for a while) before the end of the year.
If anybody would like to hold me accountable, feel free to contact me around December 20th and be very disappointed if I haven’t published a single post by then.
Thanks for the prompt Neel!