I’m working on an implementation of DQN with the Cartpole environment. I quite often see learning curves like below - the algorithm can solve the environment (an avg reward of 195 for 100 episodes) but as it keeps training it manages to forget all it’s progress.
Does anyone have any intuition why this would occur in DQN? I’ve run experiments using different learning rates , target net update frequencies and network sizes (to reduce aliasing). Nothing seems to solve this problem.
I’m wondering if maybe the correct thing to do here is to stop training all together (similar to supervised learning where once the validation loss starts to increase you early stop training). Any thoughts?