Keywords to search for other sources would be “multiplicative weight updates”, “follow the leader”, “follow the regularized leader”.
Note that this is for what’s sometimes called the “experts” setting, where you get full feedback on the counterfactual actions you didn’t take. But the same approach basically works with some slight modification for the “bandit” setting, where you only get to see the result of what you actually did.
Yes, and is there a proof of this that someone has put together? Or at least a more formal justification?
Here’s one set of lecture notes (don’t endorse that they’re necessarily the best, just first I found quickly) https://lucatrevisan.github.io/40391/lecture12.pdf
Keywords to search for other sources would be “multiplicative weight updates”, “follow the leader”, “follow the regularized leader”.
Note that this is for what’s sometimes called the “experts” setting, where you get full feedback on the counterfactual actions you didn’t take. But the same approach basically works with some slight modification for the “bandit” setting, where you only get to see the result of what you actually did.