Softmax Selection in Reinforcement Learning


I’v been reading Reinforcement Learning: An Introduction (2017) for a few days.
In the Chapter II, the authors delete the part of Softmax Selection exist in the edition of 2012 and they introduce Gradient Algorithm using a numerical preference H(t) instead of Q(t) to calculate action probability to make selection of actions.
I’v tried using direct softmax selection approach to implement the multi-armed bandit problem. Its performance is really not bad. At least, it’s better than UCB. Would you guys know why they remove this part from the book?


I think you would have to ask the author to get the real reason.

There are actually very many ways to do action selection in bandit problems, perhaps it was just to simplify things. Although softmax action selection can work OK, the temperature parameter is an awkward hyperparameter, which you have to scale carefully and the performance is dependent on the variance of the rewards.

One key difference between gradient bandit and softmax with Gibbs action selection, is that the theory behind the former crops up again in later chapter on policy gradients. Whilst softmax with Gibbs action selection does not turn up anywhere else in the book (it’s still valid in full RL, caveats about temperature hyperparameter still apply). So perhaps it was just a bit of a dead end as it didn’t lead to further discussions of action selection elsewhere. The more complex action selection options are not really discussed in detail in the book.