I made a policy-based agent that plays CartPole-v0.
I though an agent could easily master the game and have an episode that lasts an infinite amount of time.
To test that out on mine, I registered a new environment to remove the “winning” limitations:
gym.envs.register( id='CartPoleMoreSteps-v0', entry_point='gym.envs.classic_control:CartPoleEnv', max_episode_steps=None, reward_threshold=None )
My agent is just a simple dense 2 layers network that updates after every runs using discounted rewards:
This agent gets quite good after about 15 minutes of training it gets an average reward per episode of 30000 using a discount factor of 0.99.
However, after a whole night of training it only gets about 5700 of average reward per episode and all it does is get out of the level fast.
Any idea how I did incentive that behavior?