When to stop training CartPole_v0?


#1

Hi, every one.

I’m implementing basic double DQN on Cartpole_v0. There is one hidden layer with 64 neurons and when training, I got the following accumulated rewards-episode curve.
image

At the beginning, the curve is going up and every thing is fine. After the accumulated reward reaches its maximum which is 200, it appears to drop occasionally, which can be seen in the figure. Some droppings are very severe. I tried to stop the training process when the curve was at the top and it turned out to be perfect.

Is there any idea why that dropping point appears even thought the reward is high? Or is it because I have been training for too long?

Thank you.


#2

Great results Universe.
I think it is ok to stop once your average return over 100 episodes is above 198.

Would love to check out your approach. My solution balances the pole for ever after 60 episodes. You can check out my version here.
Link to code is in the description.

I also implemented a modified Actor Critic agent for Lunar Landing that displays some interesting behaviours (second link)



#3

Thank you for your thoughtful explanation. Your code utilizes two LSTM to anticipate future states and reward first to guide the agent to choose the action. It achieves an impressive result! Thanks!

My method is simple and just an implementation of DQN. I tested CartPole on three versions, deep Q-learning, double Q-learning, and duel double Q-learning under the same setting. The score of every episode during training is (I selected the most severe one)

double_DQN
Double DQN

I’m really confused why there is periodic dropping even after the score has been stable. In my opinion, after the score reaches the maximum for a while, the Q-network could approximate the action-value function well. So I can’t understand why there is dropping.

You can find my code here.
Thank you very much!