Gym environments: Does it make sense to train with a fixed seed?


#1

I’m wondering if fixing an environment’s seed during training is OK or a mistake.

My rationale for using a fixed seed during training comes from supervised learning in general. I’m specifically implementing DQN and at its core the algorithm does supervised learning on the replay memory.

In supervised learning we reuse the training set all the time; we just make sure we don’t overfit by checking against a validation set and eventually a test set.

In Gym context, I see reseting the seed (i.e. remove the fixed seed) before testing as equivalent to always having a new validation or test set. So I think that this is a valid test to make sure I did not overfit during training.

Further, the reason why I want to fix the seed during training is for people to be able to reproduce the results and to evaluate hyper parameters such as learning rate or exploration vs exploitation schedules.

On the other hand, my argument against fixing the seed is that Q-learing should theoretically learn the optimal policy even if not acting optimally, which seems to imply that I should be able to train DQN satisfactorily (as in, achieve ‘solved’ scores during testing) regardless of random weight initialization and environment stochasticity.


#2

It is worth reading this article about successes and failures of RL: https://www.alexirpan.com/2018/02/14/rl-hard.html

In short, because deep RL is often unreliable, fixing a seed for training gives you a risk that you never see your algorithm succeeding, even if it is otherwise a reasonable one. I would suggest that when you assess your algorithm’s training process, and if you want reproducible results, maybe use a small fixed set of seeds.

You would also get a rough sense of how well the training algorithm performs under different starting conditions - if it succeeds every time, that’s a good thing (whilst if it succeeds on the one seed you decided to use, you don’t know for certain that the training phase is general enough, even if the resulting trained policy is).