Catastrophic forgetting in DQN


#1

I’m working on an implementation of DQN with the Cartpole environment. I quite often see learning curves like below - the algorithm can solve the environment (an avg reward of 195 for 100 episodes) but as it keeps training it manages to forget all it’s progress.

Does anyone have any intuition why this would occur in DQN? I’ve run experiments using different learning rates , target net update frequencies and network sizes (to reduce aliasing). Nothing seems to solve this problem.

I’m wondering if maybe the correct thing to do here is to stop training all together (similar to supervised learning where once the validation loss starts to increase you early stop training). Any thoughts?


#2

I’ve experienced the exact same thing with DQN on cartpole. I presume that once the agent achieves some stable state the history is filled with examples that are related only to this stable state and the network would quickly overfit to those new events distribution.

I did experiment with early stopping or decaying the gradient as the score went up. It actually helped me achieve very high scores but I don’t think that it is satisfying solution. A huge advantage of RL (at least in those toy environments) over supervised learning is that you can generate infinitely many training samples. Therefore you can effectively avoid overfitting if you maintain proper event sampling distribution.

I would rather experiment with either modifying the event history management or batch sampling. Maybe try some simple heuristic like maintaining even distribution of the scores (numbers of iterations) for events in your history. I mean that if you were to annotate every entry in your history with the number of iterations when this entry occurred and plot a histogram of those values you would get something fairly stable (like uniform distribution or normal curve).
I assume that if you were to make such plot now it would be periodically changing its skewness. It would be left-skewed when the agent is learning and right-skewed when it began to achieve high scores.

I can never be sure, but this is something I would try myself if I went back to this problem.


#3

I’m glad you had the same experience with instability.

I’ve thought about this a bit. I think early stopping is the answer to this problem for two reasons.

1 - In supervised learning, early stopping is commonly used, with learning stopping based on the validation loss
2 - when we learn skills, we often stop improving. If I think about my handwriting style, it changed a lot over my first 15 years of life, and little over the second 15!

So perhaps continuing to force gradient updates onto the agent is a bit cruel!

I look at the distribution of the Bellman target - here is an example with no target preprocessing

The periods where the lower bound of the distribution is higher perform better - I think because in cartpole both actions should be pretty comparable in terms of expected return, unless you are right on the end of falling either left or right.

If anyone is interested, i’ve documenting my experiments with improving stability here and here.