# What strategies have you used to solve Cartpole? Our used with MDP, Bellman, LSTM

#1

Demo of the solution here using different amount of prediction steps.

Full code here

Our results

After observing the environment for 100 episode our agent is able to balance the pole for all subsequent games indefinitely. We get the best results by anticipating 4-8 steps in advance. The more the agent can anticipate (larger number of predicted steps) the more stable the pole is balanced.

Our approach

Our solution of the Open AI Cartpole combines many different aspects of Machine Learning to solve the problem in the most modular, yet efficient way possible.

Our solution learns by observing the first 100 games, then trains a sequential model made of 2 stateful LSTMs, one relu and a regular dense model for output. Once the model is trained, we do not store any discrete information about the environment. However, we first teach the model to anticipate the next state of the environment given any plausible action. In essence, the LSTMs learn that given a series of event (states + actions) this next state and reward are most probable. This is most useful because it avoids data explosion of traditional discrete MDP solutions, it is also computationally efficient.

Once our model can anticipate future states correctly, even for events it has never encountered, we used it in an MDP, solved using a Bellman approach (utility/reward at state 1 = reward + sum of discounted expected rewards of future steps given all possible actions). The twist here is that we recursively calculate expected reward using the LSTM models which is now adept at anticipating future states and rewards given a series of state-actions.

Curious to find out what strategies you have come up with.

Also, how do we officially submit results?

#2

Hello FitMachineLearning,

Very nice write-up!

My solution solves it after about 13 episodes.
I used very simple network optimized with the LMS algorithm.

Here’s the code:

I don’t think there is a way submit results at this time. I heard that there is an effort at Montreal AI to fix that.

Best Regards,
– Peter

#3

13 Episodes. Nice work SparseInference. That is very interesting.

I especially like your approach of using the mean of the past outputs as targets.

Have you tried your agent/approach on other environments?

#4

Yes, I also applied it to cartpole_v1, which is almost exactly the same. The code for that is also in my gists.

I started to investigate whether I could use the same approach on mountain_car but I haven’t pursued it further since then.

I will probably continue later this year.

Best Regards,
– Peter