CartPole-v0 agent gets really good at rolling out of the game?


I made a policy-based agent that plays CartPole-v0.

I though an agent could easily master the game and have an episode that lasts an infinite amount of time.

To test that out on mine, I registered a new environment to remove the “winning” limitations:


My agent is just a simple dense 2 layers network that updates after every runs using discounted rewards:

This agent gets quite good after about 15 minutes of training it gets an average reward per episode of 30000 using a discount factor of 0.99.

However, after a whole night of training it only gets about 5700 of average reward per episode and all it does is get out of the level fast.


Any idea how I did incentive that behavior?