I am fairly new to reinforcement learning and OpenAI, so I am not sure if this is the best place to post, but I was having some problems implementing a basic actor critic algorithm to solve the pendulum v-0 scenario on OpenAI. At this point, my algorithm is pretty much REINFORCE with a baseline. I am using parallel environments because once I get this working, I am planning on using it on a more complicated task where the environment runs much slower.
My algorithm seems to be maximizing entropy, as well as pushing the policy loss and value loss to 0, but for some reason, the reward is not increasing. I have seen other examples where people have been able to get rewards of around -400 consistently, but mine does not seem to increase above -1000 consistently. Comparing my code to others, there does not seem to be much different, yet I am not getting results. Any help would be greatly appreciated. My code and a screenshot of tensorboard are in the gist link: https://gist.github.com/dapatil211/6957ee71a4bc0d916419df9155fd9408