Related to: [Difference of Value Functions]
In this problem, how are we going we to find the optimal control policy?
The straightforward solution for a discrete action space, which I am able to think is that we approximate the value function differences for a given policy
pi and use the difference for improving the policy to a new policy using the policy improvement operator utilizing the learned differences.
Is there any other possible method for approaching this problem?