Why do you choose an arithmetic mean for aggregating these estimates?
This is a good point.
I’d add that as a general rule when aggregating binary predictions one should default to the average log odds, perhaps with an extremization factor as described in (Satopää et al, 2014).
This is a good point.
I’d add that as a general rule when aggregating binary predictions one should default to the average log odds, perhaps with an extremization factor as described in (Satopää et al, 2014).
The reasons are a) empirically, it seems to work better, b) the way Bayes rules works it seems to suggest very strongly than log odds are the natural unit of evidence, c) apparently there are some complex theoretical reasons (“external bayesianism”) why this is better (the details go a bit over my head).