Hi! I have implemented the Asynchronous Advantage Actor Critic (A3C) in Tensorflow and use it to train an agent for various Atari 2600 environments. I have noticed that it starts to learn to play Pong quite quickly (after 4-5 hours it starts scoring consistently and a good strategy definitely starts to be shaping up). However, when trying it on Space Invaders for the same number of hours (4-5) the agent seems to have converged to a strategy where the player does not move at all, it only shoots upwards from the starting position.
My question is: is this a sign that the training has converged to a sub-optimal strategy, and it will not get better regardless of how much longer I leave it to train, or is this something normal and the agent could start moving and exploring different strategies later on, after maybe a day or two of training?
I want to know whether I should definitely start debugging my algorithm now, or there is a chance it is correct, but slow? Thanks!