[Gym | MountainCar-v0] Reward value



I’m executing my first (MountainCar-v0) Gym environment with a sequence of random actions. At each step I obtain -1.0 as reward. It is always the same value, even when the random actions are “lucky” enough to make the car climb the hill.
I was expecting something like a higher value when the car gets close to the top, or faster, or maybe a function of the two… Is there some documentation for each environment explaining how to interpret the observation and reward values?

Thank you!


MountainCar is an exploration environment where the major challenge is to have an algorithm that can explore the state space until finding a goal. At each step the reward is always -1, meaning from the agents point of view there is no difference between actions until it reaches the terminal position, at which point it no longer continues to receive a negative reward (which is a good thing).

Part of the design of this environment was to challenge algorithms that used random exploration, as this (along with the slopes of the hills), will make it very unlikely for a random sequence of actions to get you to the goal state. I’m not sure on the exact percentages but I wouldn’t be surprised if it was less than 1 in a 100 million.

An example of a way around this is to have an internal bias towards less visited states, all other things being equal. So the agent prefers to be in a state on the edge of what it’s currently seen. Another is to divide up the state space and use an evolutionary algorithm to determine the policy to use, although that won’t scale well to problems with a higher number of actions/states (in an environment like this it essentially just moves the randomness from individual actions to groups of actions over similar states, using an additional parameter of the number of tiles to divide the state space into, where infinite tiles basically returns you back to the original problem)


Thank you for the detailed explanation. I think I will first try with a genetic algorithm and see how it goes.


How do you plan to use a genetic algorithm? How do you evaluate the fitness when you get no rewards in this environment?


What if the car get out from the left?
The environment give the done signal and the agent doesn’t receive the negative reward anymore too!
So how can the agent distinguish it so the agent won’t make the car climb out from left?


Because the challenge in MountainCar is exploration where most trajectories will sum to 0 reward, optimisation using genetic algorithms should actually be easier to encourage large state-action variational searches which will likely end up with at least one of them reaching the goal. It’s not likely to be the best strategy once you have a general idea of where you are heading, but to start off with searching 200 or so completely different strategies (based on a distribution) will probably do okay.


There is a minposition value set in the environment so if you hit the left wall then your velocity is set to 0, meaning you can only move away from the wall again: https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py#L46