A comment and then a question. One problem I’ve encountered in trying to explain ideas like this to a non-technical audience is that actually the standard rationales for ‘why softmax’ are either a) technical or b) not convincing or even condescending about its value as a decision-making approach. Indeed, the ‘Agents as probabilistic programs’ page you linked to introduces softmax as “People do not always choose the normatively rational actions. The softmax agent provides a simple, analytically tractable model of sub-optimal choice.” The ‘Softmax demystified’ page offers relatively technical reasons (smoothing is good, flickering bad) and an unsupported claim (it is good to pick lower utility options some of the time). Implicitly this makes presentations of ideas like this have the flavor of “trust us, you should use this because it works in practice, even it has origins in what we think is irrational or that we can’t justify”. And, to be clear, I say that as someone who’s on your side, trying to think of how to share these ideas with others. I think there is probably a link between what I’ve described above and Michael Plant’s point (3).
So, I’m wonder if ‘we can do better’ in justifying softmax (and similar approaches). What is the most convincing argument you’ve seen?
I feel like the holy grail would be an empirical demonstration that an RL agent develops softmax like properties across a range of realistic environments. And/or a theoretical argument for why this should happen.
One justification might be that in an online setting where you have to learn which options are best from past observations, the naive “follow the leader” approach—exactly maximizing your action based on whatever seems best so far—is easily exploited by an adversary.
This problem resolves itself if you make actions more likely if they’ve performed well, but regularize a little to smooth things out. The most common regularizer is entropy, and then as described on the “Softmax demystified” page, you basically end up recovering softmax (this is the well-known “multiplicative weight updates” algorithm).
Keywords to search for other sources would be “multiplicative weight updates”, “follow the leader”, “follow the regularized leader”.
Note that this is for what’s sometimes called the “experts” setting, where you get full feedback on the counterfactual actions you didn’t take. But the same approach basically works with some slight modification for the “bandit” setting, where you only get to see the result of what you actually did.
A comment and then a question. One problem I’ve encountered in trying to explain ideas like this to a non-technical audience is that actually the standard rationales for ‘why softmax’ are either a) technical or b) not convincing or even condescending about its value as a decision-making approach. Indeed, the ‘Agents as probabilistic programs’ page you linked to introduces softmax as “People do not always choose the normatively rational actions. The softmax agent provides a simple, analytically tractable model of sub-optimal choice.” The ‘Softmax demystified’ page offers relatively technical reasons (smoothing is good, flickering bad) and an unsupported claim (it is good to pick lower utility options some of the time). Implicitly this makes presentations of ideas like this have the flavor of “trust us, you should use this because it works in practice, even it has origins in what we think is irrational or that we can’t justify”. And, to be clear, I say that as someone who’s on your side, trying to think of how to share these ideas with others. I think there is probably a link between what I’ve described above and Michael Plant’s point (3).
So, I’m wonder if ‘we can do better’ in justifying softmax (and similar approaches). What is the most convincing argument you’ve seen?
I feel like the holy grail would be an empirical demonstration that an RL agent develops softmax like properties across a range of realistic environments. And/or a theoretical argument for why this should happen.
One justification might be that in an online setting where you have to learn which options are best from past observations, the naive “follow the leader” approach—exactly maximizing your action based on whatever seems best so far—is easily exploited by an adversary.
This problem resolves itself if you make actions more likely if they’ve performed well, but regularize a little to smooth things out. The most common regularizer is entropy, and then as described on the “Softmax demystified” page, you basically end up recovering softmax (this is the well-known “multiplicative weight updates” algorithm).
Yes, and is there a proof of this that someone has put together? Or at least a more formal justification?
Here’s one set of lecture notes (don’t endorse that they’re necessarily the best, just first I found quickly) https://lucatrevisan.github.io/40391/lecture12.pdf
Keywords to search for other sources would be “multiplicative weight updates”, “follow the leader”, “follow the regularized leader”.
Note that this is for what’s sometimes called the “experts” setting, where you get full feedback on the counterfactual actions you didn’t take. But the same approach basically works with some slight modification for the “bandit” setting, where you only get to see the result of what you actually did.