How to make policy gradient work with continuous rewards?


I have looked at the policy from pong post posted by Karpathy to understand more about policy gradient.

It seems to me that it works when the reward can be bisected into positive and negative rewards. I am wondering how I can make it work with problem like Acrobot-v1 where the rewards are all negatives?


I am not sure if it works, but you could try normalizing the rewards to [-1,1] if you know what the maximum expected reward for each episode is.