I personally tend to stick to the following system:
Every Monday morning I plan my week, usually collecting anything between 20 and 50 tasks I’d like to get done that week (this planning step usually takes me ~20 minutes)
Most such tasks are clear enough that I don’t need to specify any further definition of done; examples would be “publish a post in the EA forum”, “work 3 hours on project X”, “water the plants” or “attend my local group’s EA social” – very little “wiggle room” or risk of not knowing whether any of these evaluates to true or false in the end
In a few cases, I do need to specify in greater detail what it means for the task to be done; e.g. “tidy up bedroom” isn’t very concrete, and I thus either timebox it or add a less ambiguous evaluation criterion
Then I go through my predictions from the week before and evaluate them based on which items are crossed off my weekly to do list (~3 minutes)
“Evaluate” at first only means writing a 1 or a 0 in my spreadsheet next to the predicted probability
There are rare exceptions where I drop individual predictions entirely due to inability to evaluate them properly, e.g. because the criterion seemed clear during planning, but it later turned out I had failed to take some aspect or event into consideration[1], or because I deliberately decided to not do the task for unforeseeable reasons[2]. Of course I could invest more time into bulletproofing my predictions to prevent such cases altogether, but my impression is that it wouldn’t be worth the effort.
After that I check my performance of that week as well as of the most recent 250 predictions (~2 minutes)
For the week itself, I usually only compare the expected value (sum of probabilities) with actually resolved tasks, to check for general over- or underconfidence, as there aren’t enough predictions to evaluate individual percentage ranges
For the most recent 250 predictions I check my calibration by having the predictions sorted into probability ranges of 0..9%, 10..19%, … 90..99%.[3] and checking how much the average outcome ratio of each category deviates from the average of predictions in that range. This is just a quick visual check, which lets me know in which percentage range I tend to be far off.
I try to use both these results in order to adjust my predictions for the upcoming week in the next step
Finally I assign probabilities to all the tasks. I keep this list of predictions hidden from myself throughout the following week in order to minimize the undesired effect of my predictions affecting my behavior (~5 minutes)
These predictions are very much System 1 based and any single prediction usually takes no more than a few seconds.
I can’t remember how difficult this was when I started this system ~1.5 years ago, but by now coming up with probabilities feels highly natural and I differentiate between things being e.g. 81% likely or 83% likely without the distinction feeling arbitrary.
Depending on how striking the results from the evaluation steps were, I slightly adjust the intuitively generated numbers. This also happens intuitively as opposed to following some formal mathematical process.
While this may sound complex when explaining it, I added the time estimates to the list above in order to demonstrate that all of these steps are pretty quick and easy. Spending these 10 minutes[4] each week seems like a fair price for the benefits it brings.
An example would be “make check up appointment with my dentist”, but when calling during the week realizing the dentist is on vacation and no appointment can be made; given there’s no time pressure and I prefer making an appointment there later to calling a different dentist, the task itself was not achieved, yet my behavior was as desired; as there are arguments to be made to evaluate this both as true or false, I often just drop such cases entirely from my evaluation
I once had the task “sign up for library membership” on my list, but then during the week realized that membership was more expensive than I had thought, and thus decided to drop that goal; here too, you could either argue “the goal is concluded” (no todo remains open at the end of the week) or “I failed the task” (as I didn’t do the formulated action), so I usually ignore those cases instead of evaluating them arbitrarily
One could argue that a 5% and a 95% prediction should really end up in the same bucket, as they entail the same level of certainty; my experience with this particular forecasting domain however is that the symmetry implied by this argument is not necessarily given here. The category of things you’re very likely to do seems highly different in nature from the category of things you’re very unlikely to do. This lack of symmetry can also be observed in the fact that 90% predictions are ~10x more frequent for me in this domain than 10% predictions.
It’s 30 minutes total, but the first 20 are just the planning process itself, whereas the 3+2+5 afterwards are the actual forecasting & calibration training.
Do you have a way to automatically calculate averages for each probability bucket? I’m trying to start a system like this, but I don’t see a way to score the percent right I get in each probability category other than manually selecting which predictions go into which bucket (right now all my predictions are each a single row in an excel sheet).
Sort of, so firstly I have a field next to each prediction that automatically computes its “bucket number” (which is just FLOOR(<prediction> * 10)). To then get the average probability of a certain bucket, I run the following: =AVERAGE(INDEX(FILTER(C$19:K, K$19:K=A14), , 1)) - note that this is google sheets though and I’m not sure to which degree this transfers to Excel. For context, column C contains my predicted probabilities, column K contains the computed bucket numbers, and A14 here is the bucket for which I’m computing this. Similarly I count the number of predictions of a given bucket with =ROWS(FILTER(K$19:K, K$19:K<>"", K$19:K=A14)) and the ratio of predictions in that bucket that ended up true with =COUNTIF(FILTER(D$19:K, K$19:K=A14), "=1") / D14 (D19 onwards contains 1 and 0 values depending on if the prediction happened or not; D14 is the aforementioned number of predictions in that bucket).
If this doesn’t help, let me know and I can clear up one such spreadsheet, see if I can export it as xlsx file and send it to you.
Do you only make forecasts that resolve within the week? I imagine it would also be useful to sharpen one’s predictive skills for longer timeframes, e.g. achieving milestones of a project, finishing a chapter of your thesis, etc.
Good point, I also make predictions about quarterly goals (which I update twice a month) as well as my plans for the year. I find the latter especially difficult, as quite a lot can change within a year including my perspective on and priority of the goals. For short term goals you basically only need to predict to what degree you will act in accordance with your preferences, whereas for longer term goals you also need to take potential changes of your preferences into account.
It does appear to me that calibration can differ between the different time frames. I seem to be well calibrated regarding weekly plans, decently calibrated on the quarter level, and probably less so on the year level (I don’t yet have any data for the latter). Admittedly that weakens the “calibration can be achieved quickly in this domain” to a degree, as calibrating on “behavior over the next year” might still take a year or two to significantly improve.
I personally tend to stick to the following system:
Every Monday morning I plan my week, usually collecting anything between 20 and 50 tasks I’d like to get done that week (this planning step usually takes me ~20 minutes)
Most such tasks are clear enough that I don’t need to specify any further definition of done; examples would be “publish a post in the EA forum”, “work 3 hours on project X”, “water the plants” or “attend my local group’s EA social” – very little “wiggle room” or risk of not knowing whether any of these evaluates to true or false in the end
In a few cases, I do need to specify in greater detail what it means for the task to be done; e.g. “tidy up bedroom” isn’t very concrete, and I thus either timebox it or add a less ambiguous evaluation criterion
Then I go through my predictions from the week before and evaluate them based on which items are crossed off my weekly to do list (~3 minutes)
“Evaluate” at first only means writing a 1 or a 0 in my spreadsheet next to the predicted probability
There are rare exceptions where I drop individual predictions entirely due to inability to evaluate them properly, e.g. because the criterion seemed clear during planning, but it later turned out I had failed to take some aspect or event into consideration[1], or because I deliberately decided to not do the task for unforeseeable reasons[2]. Of course I could invest more time into bulletproofing my predictions to prevent such cases altogether, but my impression is that it wouldn’t be worth the effort.
After that I check my performance of that week as well as of the most recent 250 predictions (~2 minutes)
For the week itself, I usually only compare the expected value (sum of probabilities) with actually resolved tasks, to check for general over- or underconfidence, as there aren’t enough predictions to evaluate individual percentage ranges
For the most recent 250 predictions I check my calibration by having the predictions sorted into probability ranges of 0..9%, 10..19%, … 90..99%.[3] and checking how much the average outcome ratio of each category deviates from the average of predictions in that range. This is just a quick visual check, which lets me know in which percentage range I tend to be far off.
I try to use both these results in order to adjust my predictions for the upcoming week in the next step
Finally I assign probabilities to all the tasks. I keep this list of predictions hidden from myself throughout the following week in order to minimize the undesired effect of my predictions affecting my behavior (~5 minutes)
These predictions are very much System 1 based and any single prediction usually takes no more than a few seconds.
I can’t remember how difficult this was when I started this system ~1.5 years ago, but by now coming up with probabilities feels highly natural and I differentiate between things being e.g. 81% likely or 83% likely without the distinction feeling arbitrary.
Depending on how striking the results from the evaluation steps were, I slightly adjust the intuitively generated numbers. This also happens intuitively as opposed to following some formal mathematical process.
While this may sound complex when explaining it, I added the time estimates to the list above in order to demonstrate that all of these steps are pretty quick and easy. Spending these 10 minutes[4] each week seems like a fair price for the benefits it brings.
An example would be “make check up appointment with my dentist”, but when calling during the week realizing the dentist is on vacation and no appointment can be made; given there’s no time pressure and I prefer making an appointment there later to calling a different dentist, the task itself was not achieved, yet my behavior was as desired; as there are arguments to be made to evaluate this both as true or false, I often just drop such cases entirely from my evaluation
I once had the task “sign up for library membership” on my list, but then during the week realized that membership was more expensive than I had thought, and thus decided to drop that goal; here too, you could either argue “the goal is concluded” (no todo remains open at the end of the week) or “I failed the task” (as I didn’t do the formulated action), so I usually ignore those cases instead of evaluating them arbitrarily
One could argue that a 5% and a 95% prediction should really end up in the same bucket, as they entail the same level of certainty; my experience with this particular forecasting domain however is that the symmetry implied by this argument is not necessarily given here. The category of things you’re very likely to do seems highly different in nature from the category of things you’re very unlikely to do. This lack of symmetry can also be observed in the fact that 90% predictions are ~10x more frequent for me in this domain than 10% predictions.
It’s 30 minutes total, but the first 20 are just the planning process itself, whereas the 3+2+5 afterwards are the actual forecasting & calibration training.
Do you have a way to automatically calculate averages for each probability bucket? I’m trying to start a system like this, but I don’t see a way to score the percent right I get in each probability category other than manually selecting which predictions go into which bucket (right now all my predictions are each a single row in an excel sheet).
Sort of, so firstly I have a field next to each prediction that automatically computes its “bucket number” (which is just
FLOOR(<prediction> * 10)
). To then get the average probability of a certain bucket, I run the following:=AVERAGE(INDEX(FILTER(C$19:K, K$19:K=A14), , 1))
- note that this is google sheets though and I’m not sure to which degree this transfers to Excel. For context, column C contains my predicted probabilities, column K contains the computed bucket numbers, and A14 here is the bucket for which I’m computing this. Similarly I count the number of predictions of a given bucket with=ROWS(FILTER(K$19:K, K$19:K<>"", K$19:K=A14))
and the ratio of predictions in that bucket that ended up true with=COUNTIF(FILTER(D$19:K, K$19:K=A14), "=1") / D14
(D19 onwards contains 1 and 0 values depending on if the prediction happened or not; D14 is the aforementioned number of predictions in that bucket).If this doesn’t help, let me know and I can clear up one such spreadsheet, see if I can export it as xlsx file and send it to you.
Do you only make forecasts that resolve within the week? I imagine it would also be useful to sharpen one’s predictive skills for longer timeframes, e.g. achieving milestones of a project, finishing a chapter of your thesis, etc.
Good point, I also make predictions about quarterly goals (which I update twice a month) as well as my plans for the year. I find the latter especially difficult, as quite a lot can change within a year including my perspective on and priority of the goals. For short term goals you basically only need to predict to what degree you will act in accordance with your preferences, whereas for longer term goals you also need to take potential changes of your preferences into account.
It does appear to me that calibration can differ between the different time frames. I seem to be well calibrated regarding weekly plans, decently calibrated on the quarter level, and probably less so on the year level (I don’t yet have any data for the latter). Admittedly that weakens the “calibration can be achieved quickly in this domain” to a degree, as calibrating on “behavior over the next year” might still take a year or two to significantly improve.