I’m implementing basic double DQN on Cartpole_v0. There is one hidden layer with 64 neurons and when training, I got the following accumulated rewards-episode curve.

At the beginning, the curve is going up and every thing is fine. After the accumulated reward reaches its maximum which is 200, it appears to drop occasionally, which can be seen in the figure. Some droppings are very severe. I tried to stop the training process when the curve was at the top and it turned out to be perfect.

Is there any idea why that dropping point appears even thought the reward is high? Or is it because I have been training for too long?

I think it is ok to stop once your average return over 100 episodes is above 198.

Thank you for your thoughtful explanation. Your code utilizes two LSTM to anticipate future states and reward first to guide the agent to choose the action. It achieves an impressive result! Thanks!

My method is simple and just an implementation of DQN. I tested CartPole on three versions, deep Q-learning, double Q-learning, and duel double Q-learning under the same setting. The score of every episode during training is (I selected the most severe one)

Double DQN

I’m really confused why there is periodic dropping even after the score has been stable. In my opinion, after the score reaches the maximum for a while, the Q-network could approximate the action-value function well. So I can’t understand why there is dropping.

