Many users were "cheating"


#1

When I was looking at some of the submitted result of Gym environments. I found many extremely “good training performance” does not even have a “learning-curve”-like learning curve. They must have cheated from personal point of view.

This one is certainly a cheating example:
https://gym.openai.com/evaluations/eval_NZKl9J8wTHC3VNQNREUt2Q#comments

He trained the model in the first place, and used the trained model to present the “learning curve”.

Even with “learning-curve”-like learning curve, it is not guaranteed that the algorithm is good unless the user provide the raw code. One can easily manipulate the code to get the a good learning curve from a bad algorithm as long as the model could finally be trained to do things well.

We certainly welcome good training algorithms. However, the cheated result from these people are very misleading. Everyone’s time is precious. I wish there could be some rules to prevent these.


#2

Not necessarily, so some folks may have been running training on one run, saved the weights, then tested them after to see how they run. This is not that uncommon, see how your “learned” agent runs without a training period anymore.

We had an assignment recently which we did something similar, we had to solve a problem, then take the trained agent and run it again in a “test” scenario and finally see how it did. (did it solve it immediately or take some time, mine took a handful of iterations before reaching it, but a better weighted network might show better results).


#3

This definitely appears to be a case of misleading representation of results. The algorithm trains until it reaches the required average reward then immediately breaks and creates the observation wrapper and begins recording the actual attempt. This doesn’t demonstrate the performance of the model over time at all.

I’ve noticed a similar thing with some other submissions, for example: https://gym.openai.com/evaluations/eval_87oe5SWiSquWCzOYL35VPg
This attempt gathers 1000 samples with random inputs before the wrapper is initialised. It should be defined as cheating.


#4

EDIT: Didn’t realize I had already replied to this post, but oh well, I’ll leave this here, and hopefully not have any reason to reply again.

There is a difference between testing on a trained model and the results of training a model, I trained a model until it achieved the goal of 200 and then re-ran on the trained results (as per a class assignment), and didn’t get it that quickly (still took multiple hundred runs to get there), no this isn’t cheating this is a difference between training a model and testing a model.

As far as cheating by randomly selecting 1000 samples, I would argue that the link you provided is simply another way to solve the problem, instead of acting epsilon greedy, they chose random actions for a while to explore the statespace prior to actually doing any learning, and it’s an exploration vs exploitation choice made by the algorithm designer, but what is really cheating in this context? This isn’t truly a contest, it is simply a metric we can chose to aim for. There is no money or fame as far as I’m aware of, and even if there was, the algorithm would be scrutinized to decide whether it was worthy of fame or not :slight_smile: