A3C Not converging..... at all


#1

Hi guys.

I have made my first attempt in learning the A3C approach to the deep reinforced learning. As I am currently running on WIN10 I cannot utilise the GYM itself, so I decided to implement my own simple game. The board is simply 21x21 grid on which you have positive rewards (green spots) negative rewards (red spots) and the player (white spot). The player can move up, down, left, right - that is it. It the move results in moving to the field with green spot you get +1 point if you end up on red -1. If the new position has no object on it it is 0. If the reward is consumed new one is spawned on the board in random location. I thought this is going to be as simple as it can get… but the net is not learning… at all. Even after 24h of training it seems to be randomly moving with no real strategy whatsoever.

At the beginning I thought it was my implementation of A3C that was wrong, so I used the one implemented by jaromiru here (https://github.com/jaara/ai_examples/blob/master/open_gym/a3c_cartpole.py55) to start with. This one however seems to be having this same problem now.

I uploaded my latest (not so clean) code to github here (https://github.com/Palkos83/DeepSee)

The network I use has 2Conv layers + 1 FC layer and then policy and value prediction layers (you can see the graph on the github repo). What is really strange is that the value function seems to be always 0.

So where do I take it from here? There seems to be a lot of materials about the nets that are learning but no way of finding any hints what to do when they not.
Is it the architecture that is wrong in any way (not enough layers, filters etc)?
Is it the learning process itself that is wrong? Not enough exploration? Too fast learning rate?
Finally I was trying to force tensorboard to display the weights of the conv layers as image - no luck. Any hint how to do this?

Many thanks for help. Much appreciated.


#2

Just a small clarification. The code checked in to the github actually do not use Green or Red for rewards. I was trying to use gray scale in this code - sadly with this same result.

Any hints?