Demo of the solution here using different amount of prediction steps.

Full code here

Our results

After observing the environment for 100 episode our agent is able to balance the pole for all subsequent games indefinitely. We get the best results by anticipating 4-8 steps in advance. The more the agent can anticipate (larger number of predicted steps) the more stable the pole is balanced.

Our approach

Our solution of the Open AI Cartpole combines many different aspects of Machine Learning to solve the problem in the most modular, yet efficient way possible.

Our solution learns by observing the first 100 games, then trains a sequential model made of 2 stateful LSTMs, one relu and a regular dense model for output. Once the model is trained, we do not store any discrete information about the environment. However, we first teach the model to anticipate the next state of the environment given any plausible action. In essence, the LSTMs learn that given a series of event (states + actions) this next state and reward are most probable. This is most useful because it avoids data explosion of traditional discrete MDP solutions, it is also computationally efficient.

Once our model can anticipate future states correctly, even for events it has never encountered, we used it in an MDP, solved using a Bellman approach (utility/reward at state 1 = reward + sum of discounted expected rewards of future steps given all possible actions). The twist here is that we recursively calculate expected reward using the LSTM models which is now adept at anticipating future states and rewards given a series of state-actions.

Curious to find out what strategies you have come up with.

Also, how do we officially submit results?