Thank you for your thoughtful explanation. Your code utilizes two LSTM to anticipate future states and reward first to guide the agent to choose the action. It achieves an impressive result! Thanks!
My method is simple and just an implementation of DQN. I tested CartPole on three versions, deep Q-learning, double Q-learning, and duel double Q-learning under the same setting. The score of every episode during training is (I selected the most severe one)
I’m really confused why there is periodic dropping even after the score has been stable. In my opinion, after the score reaches the maximum for a while, the Q-network could approximate the action-value function well. So I can’t understand why there is dropping.
You can find my code here.
Thank you very much!