I’m confused about the way to evaluate whether an agent solves an environment or not, for environments that show a definition for ‘solved’ in the wiki.
For example, take LunarLander-v2 (discrete). It reads for the solved condition:
LunarLander-v2 defines “solving” as getting average reward of 200 over 100 consecutive trials.
It it not clear to me if consecutive trials include training episodes too or strictly testing episodes AFTER I train my agent for however zillion episodes I want.
From ML in general, I’d assume that, first I need to train for however many episodes or time steps I choose to. And only AFTER training could I test if I solved the problem by executing the learnt policy for as many episodes as I choose to until I might get an average of 200 over 100 consecutive testing episodes.
For example, I tried training a DQN on LunarLander for 1200 episodes. At around 1000 episodes worth of training, I got an average reward of 200 over the last 100 training episodes. But that should not constitute solving the problem; I am still training.
After training for 1200 episodes, I tried both the fully (i.e. 1200 episodes) trained model and the checkpoint intermediate model from when I achieved the training average of 200 (aprox. 1000 episodes). I tried these by just testing the learnt policies against the environment, without further modifications to the underlying networks.
Neither model achieved an average of 200 over 100 episodes, for a 300 episode test.
My conclusion is that either I should have tested longer or I did not solve the problem even though in training it seemed I might had.
Another thing throwing me off is that for example, in cart pole-v0, the leaderboard show entries where theoretically people solved the problem in less than 100 episodes. I don’t see how this is possible if the definition of solved requires an average over 100 episodes, which means at least 100 episodes are required to declare the problem solved.