I’m wondering if fixing an environment’s seed during training is OK or a mistake.
My rationale for using a fixed seed during training comes from supervised learning in general. I’m specifically implementing DQN and at its core the algorithm does supervised learning on the replay memory.
In supervised learning we reuse the training set all the time; we just make sure we don’t overfit by checking against a validation set and eventually a test set.
In Gym context, I see reseting the seed (i.e. remove the fixed seed) before testing as equivalent to always having a new validation or test set. So I think that this is a valid test to make sure I did not overfit during training.
Further, the reason why I want to fix the seed during training is for people to be able to reproduce the results and to evaluate hyper parameters such as learning rate or exploration vs exploitation schedules.
On the other hand, my argument against fixing the seed is that Q-learing should theoretically learn the optimal policy even if not acting optimally, which seems to imply that I should be able to train DQN satisfactorily (as in, achieve ‘solved’ scores during testing) regardless of random weight initialization and environment stochasticity.