Some papers you might like if you haven’t seen them yet:
This paper uses your two methods, as well as mapping verbal descriptions to probabilities
Anthropic investigation of language model calibration. Interesting techniques include asking the model whether its answer was correct, using temperature scaling to restore calibration after RLHF, and training models to be better calibrated.
Foundational overview of calibration in ML models, advocates temperature scaling
Paper showing calibration often suffers under distribution shift of the dataset
Some papers you might like if you haven’t seen them yet:
This paper uses your two methods, as well as mapping verbal descriptions to probabilities
Anthropic investigation of language model calibration. Interesting techniques include asking the model whether its answer was correct, using temperature scaling to restore calibration after RLHF, and training models to be better calibrated.
Foundational overview of calibration in ML models, advocates temperature scaling
Paper showing calibration often suffers under distribution shift of the dataset
Forecasting benchmark for language models
Nice, thanks