So I think I understand most of the openAI A2C baseline code however I have one question.
How is the single step model updated after training? Two policy models are created one single step and one n_step for training. The single step is used to get the actual rewards from the env and the train model is used to predict these rewards. When we train the model all the losses are based on the train model so when we update the parameters I do not see how the single step model is ever updated. (lines 35-50) . Please explain