Deep Q learning Pong


Hi, I am trying to do deep q learning from scratch on a dame of pong. I read the deep mind paper on playing atari games and came up with a simple (though not elegant or efficient) way to program it based on that paper. I wanted to know if my general idea of how to program it is correct and if it would work!

(python psuedo code)

using pygame, tensorflow, numpy etc

Define computation graph (variables, placeholders, connections etc)
- graph will supposedly take in a sequence of 4 images which are the 4 most recent frames
- and output action values for up or down for that sequence

  • initialize variables of computation graph

** lets define the function of the computation graph as F(sequence) =

frame = get initial frame of game

  • set current sequence to [frame,frame,frame,frame]

  • sequence_counter = 0

  • batch_counter = 0
    begin game loop
    if(sequence counter =0 ){

    Action = MaxIndex(F(current sequence)) #input sequence into computation graph, get action values

    Paddle moves up or down

    • observe reward,

      Next_sequence[sequence_counter] = get_frame()
      Sequence_Reward = reward


      if sequence_counter = 4

      sequence counter = 0
      add (Current Sequence, Actiontaken , reward, Next_Sequence) to training data
      reward = 0
      if BatchCounter == 1000000
      perform stochastic gradient descent from batch of data from training data collected from observed sequences
      BatchCounter = 0

end game loop


If I understand this correctly (Please correct me if I am wrong), you get a sequence of 4 frames and

  1. see if you have it in memory
    If you do you replay the best action corresponding to this
    if not you store it
  2. then you train your model on your collected data collections of 4-sequences of (State, action, reward and next sate)

I am not really sure where the magic (Decision making) happens. Can you help me there.

If I also understand why you are using 4 frame sequences as input (i.e. determine trajectory and other), then why not use an LSTM, very suited for this type of problem.


An important detail used for most value-based RL algorithms is that you need F() to be a prediction not of expected reward from a 4-frame sequence, but of expected return, which is defined as the discounted sum of all future rewards.

As you do not have this value in the table, the solution to that is to “boostrap” it from your predictor. If S, A are your current state and action and S’, a’ are your next state and possible action choice, then in Q-learning your target for training each F() - known as the TD target would be

F(S, A) target = R + discount * max_a’[F(S’,a’)]

You have to re-calculate this every time you train, as the agent gets better at predicting the true return, the training targets for F() in the mini-batch will change. Discount is often represented by Greek letter lowercase gamma, and should be between 0 and 1. Typical values for continuous problems with small time steps are close to 1, e.g. 0.99 or 0.999

Unfortunately this is not a stable algorithm. I suggest you read the DQN paper to see what they did about the instability - one thing is experience replay, not too different from your idea of saving up very many iterations and then training the agent at the end (your algorithm may even work in this regard, but 10,000,000 iterations, played randomly before training would take a very long time, and the data quality from completely random play might not be enough to train an expert player in one step - ideally you want to sample data from slightly better and better players as time progresses, in order to make more continuous improvements to the agent).



keep it simple, right?

Easiest solution is to just predict the value of next state for action {up,down, none}
As Dr. Sutton says, predict the prediction…all you need.

So wtf does that mean.

  1. Initialize some network randomly.

  2. Play for x episodes episode {state1,2…n}. At end of episode you get a reward, which you need to discount meaning terminal state has +1 or -1, and you scale it down through states of your episodes.

  3. After x episodes done, you train your network to predict the discounted value of each state you experienced.

  4. …and you go to 1 again. Except now your weights changed and you try again to predict the value of state if you move up, down or not and take the action with the highest predicted value. Done

As you continue iterating this, you will eventually converge. Don’t waste time implementing replay buffers, lstms, Q this, Q that etc. To solve majority of RL problems all you really need is some form of temporal difference exploiting the chain rule (ie your policy depends on value prediction, and you are continuously updating your value func based on the result of experience of your policy function).

Don’t spend too much time trying to understand what is Q-learning, dualing Q learning, or any other kind of witchcraft, be it ACER, A3C, A2C, etc… you will find yourself implementing these things on your own as your problem dictates it. The reason these things all have different names is because people that toy around with this code work for at ‘academic’ institutions which implies they have to publish original research and in the field of econ/comp sci, ‘original’ just means changing one term and giving the method a radical new name. Keep it simple, TD all the way. Just need to understand basic neural networks and reward discounting, which Dr. Minsky figured out 4 decades ago if not more.