Gym environments: should I evaluate solve condition during or after training?


#1

I’m confused about the way to evaluate whether an agent solves an environment or not, for environments that show a definition for ‘solved’ in the wiki.

For example, take LunarLander-v2 (discrete). It reads for the solved condition:

LunarLander-v2 defines “solving” as getting average reward of 200 over 100 consecutive trials.

It it not clear to me if consecutive trials include training episodes too or strictly testing episodes AFTER I train my agent for however zillion episodes I want.

From ML in general, I’d assume that, first I need to train for however many episodes or time steps I choose to. And only AFTER training could I test if I solved the problem by executing the learnt policy for as many episodes as I choose to until I might get an average of 200 over 100 consecutive testing episodes.

For example, I tried training a DQN on LunarLander for 1200 episodes. At around 1000 episodes worth of training, I got an average reward of 200 over the last 100 training episodes. But that should not constitute solving the problem; I am still training.

After training for 1200 episodes, I tried both the fully (i.e. 1200 episodes) trained model and the checkpoint intermediate model from when I achieved the training average of 200 (aprox. 1000 episodes). I tried these by just testing the learnt policies against the environment, without further modifications to the underlying networks.

Neither model achieved an average of 200 over 100 episodes, for a 300 episode test.

My conclusion is that either I should have tested longer or I did not solve the problem even though in training it seemed I might had.

Another thing throwing me off is that for example, in cart pole-v0, the leaderboard show entries where theoretically people solved the problem in less than 100 episodes. I don’t see how this is possible if the definition of solved requires an average over 100 episodes, which means at least 100 episodes are required to declare the problem solved.


#2

It should just be a pure test process, without training, for checking the “solved” goal of Gym environments.

You would not plot any test episodes on a training graph, or count them towards the episode count where the agent is learning.

From ML in general, I’d assume that, first I need to train for however many episodes or time steps I choose to. And only AFTER training could I test if I solved the problem by executing the learnt policy for as many episodes as I choose to until I might get an average of 200 over 100 consecutive testing episodes.

I am fairly sure this is correct.

In some RL scenarios (most notably for bandit problems), we do care about reward gained whilst exploring and training. However, with the gym environments, the “solved” criterion can be applied to any agent.

Another thing throwing me off is that for example, in cart pole-v0, the leaderboard show entries where theoretically people solved the problem in less than 100 episodes. I don’t see how this is possible if the definition of solved requires an average over 100 episodes, which means at least 100 episodes are required to declare the problem solved.

The episodes used to test meeting goal criteria do not count towards the episodes. There were no rules about coding the agents, or what approach they used, so it is even possible for an agent coded around planning and access to environment model to solve an environment without any training at all.

For fair comparison, you would want to see the implementations at the top of the leaderboard in case they started with some key assumptions or prior knowledge that you are not using.


#3

Could you clarify this a bit more?

For example, assume cartpole, where solved means average 195 over 100 consecutive episodes. And assume I will test for a maximum of 1000 episodes (once I am done training).

Do you mean that these 2 scenarios are equivalent for the purposes of claiming it is solved in 100 episodes?

A- I get a reward of 195 or more on testing episodes 1 thru 100.
B- I get a reward of 195 or more on testing episodes 100 thru 200.

Or do you mean something else? Because regardless of the coding approach (i.e. planning as opposed to learning), I don’t see how someone can say they solved it in less than 100 episodes, if the condition for solving states it is an average.

For example, if after training I consistently get 195 or more since the first testing episode (like in scenario A above), should I then consider it solved in one testing episode? (since all episodes will score at least 195).

Another doubt I have is, what happens if I get an average of 195 on 100 consecutive testing episodes but then get a ‘bad batch’ of states and my reward drops below the 195 average (and potentially goes back up again at some point). This could happen if the environment is noisy enough. Would we still consider it solved because ‘at some point during testing’ I did get a 195 average over 100 consecutive episodes?

Thanks!


#4

For example, assume cartpole, where solved means average 195 over 100 consecutive episodes.

OK

And assume I will test for a maximum of 1000 episodes (once I am done training).

You should really test for 100 episodes, that is the criteria. If you test for more episodes, you should take the average over all of them. Do not select the best 100 consecutive, because that will over-estimate your agent’s score.

Do you mean that these 2 scenarios are equivalent for the purposes of claiming it is solved in 100 episodes?
A- I get a reward of 195 or more on testing episodes 1 thru 100.
B- I get a reward of 195 or more on testing episodes 100 thru 200.
Or do you mean something else?

I mean something else. You either train, in which case you could training episodes. Or you test, when you decide in advance how many episodes to test for.

In supervised learning, you might plot training epochs. You may cross-validate or test your solution at any point. You don’t count the data used to cross-validate as part of training, and you don’t plot it on learning curves. It is the same thing here. Episodes used to purely evaluate an agent are not part of training, and you don’t plot or report them on learning curves. You note them in how you measured results.

Because regardless of the coding approach (i.e. planning as opposed to learning), I don’t see how someone can say they solved it in less than 100 episodes, if the condition for solving states it is an average.

It’s an average taken in a test scenario. The number of episodes to train is reported from training scenarios. You don’t count them in the same sequence.

For example, if after training I consistently get 195 or more since the first testing episode (like in scenario A above), should I then consider it solved in one testing episode? (since all episodes will score at least 195).

This should not happen. You always use the same number of episodes during test. E.g. you test over 100 episodes with CartPole, if you are following Gym guidelines.

Another doubt I have is, what happens if I get an average of 195 on 100 consecutive testing episodes but then get a ‘bad batch’ of states and my reward drops below the 195 average (and potentially goes back up again at some point). This could happen if the environment is noisy enough.

You test with 100 episodes to get a pass/fail. Yes this can be a noisy result. Same as supervised learning.

Would we still consider it solved because ‘at some point during testing’ I did get a 195 average over 100 consecutive episodes?

You should not test in this way. Test with 100 episodes if you want to repeat results as shown in Gym. If you want to test with >100 episodes, then take an average over all of them. Decide in advance how many you want to test with, and stick with that. Do not repeat tests until you get a satisfactory result because then your test will be biased. Same applies to looking for consecutive runs of good or bad results inside a larger test set.

Yes that means you can get false positives and false negatives. You can treat the “solved” target in some ways as guidance from the creators as to what level of consistency they expect to reduce errors in testing.

If you care about being more accurate, test for more episodes and take an average over them. The target for total reward should be the same.


#5

Ok, there’s still something not clear to me.

When cartpole states:

CartPole-v0 defines “solving” as getting average reward of 195.0 over 100 consecutive trials.

  1. Do we agree this means:
    1.a first train for however many episodes you want.
    1.b then test for exactly 100 episodes. If you get an average over those 100 episodes of 195 or more, you solved it.

  2. If so, when the leaderboard reports the first place as taking 16 ‘Episodes before solve’, what do they mean? If these were test episodes, 16 would not be valid because 16 episodes are not enough to report the ‘winning average’ which is instead for 100 episodes. So it has to mean something else.