I am new in RL and it’s my first question here. I just need some clarifications about the expected reward for state-action-next state. As defined in the book of Sutton it is:
My question is simply, why should we divide by p(s’|s,a)?
All your explanations are welcome!