MountainCar is an exploration environment where the major challenge is to have an algorithm that can explore the state space until finding a goal. At each step the reward is always -1, meaning from the agents point of view there is no difference between actions until it reaches the terminal position, at which point it no longer continues to receive a negative reward (which is a good thing).
Part of the design of this environment was to challenge algorithms that used random exploration, as this (along with the slopes of the hills), will make it very unlikely for a random sequence of actions to get you to the goal state. I’m not sure on the exact percentages but I wouldn’t be surprised if it was less than 1 in a 100 million.
An example of a way around this is to have an internal bias towards less visited states, all other things being equal. So the agent prefers to be in a state on the edge of what it’s currently seen. Another is to divide up the state space and use an evolutionary algorithm to determine the policy to use, although that won’t scale well to problems with a higher number of actions/states (in an environment like this it essentially just moves the randomness from individual actions to groups of actions over similar states, using an additional parameter of the number of tiles to divide the state space into, where infinite tiles basically returns you back to the original problem)