Difference of value functions


In this problem, how are we going we to find the optimal control policy?
The straightforward solution for a discrete action space, which I am able to think is that we approximate the value function differences for a given policy pi and use the difference for improving the policy to a new policy using the policy improvement operator utilizing the learned differences.

Is there any other possible method for approaching this problem?