Can’t think of anything better than a t-test, but open for suggestions.
If a forecaster is consistently off by like 10 percentage points—I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.
Good comment, thank you!