I’m just getting into machine learning, and I’m using Dhruv Parthasarathy’s Pong policy gradient code (in itself based on the work of Andrej Karpathy). I figured that a good first foray would be to try and adapt this code to other environments in gym. I chose Go9x9, but I’m open to other environments with fewer actions.
Parthasarathy’s code uses the network to calculate a single probability: that of moving the paddle up. I’m interested if someone could give me pointers on adapting this to instead work with multiple actions, rather than a single one. If there is a better baseline example that I could adapt, I’d be interested to know as well