Can someone explain the policy gradient output layers for continuous & discrete action spaces?


I’m building my first policy gradient learner. There are lots of examples of parameterizing a policy for a discrete action space (usually using a softmax on the output layer). It’s not so clear to me what to do for a continuous action space (say for example if we are parameterizing the mean of a normal distrubtion).

Can anyone help me gain some intuition on what is going on for both cases?


Yes, most common approach is to have a NN with observations as the inputs and means of each action dimension as the outputs. You also need to learn the variances, this is typically done with just a learnable vector (not a full NN). You will see the variances decrease as the policy improves during training. I have personally tried learning the variances with a NN (both separate and combined with the mean value NN). The results weren’t much better or worse. I’ve also tried learning a full covariance matrix for the Gaussian policy, sometimes this worked better, other cases not much improvement.


Thanks for your reply. I’m struggling to fully understand the training process.

I understand that for discrete action spaces, the model is trained by outputting the action probabilities and indexing on the actual action taken, then the loss function becomes
loss = - log(action_probability) * discounted rewards

I’m trying to build a simple Monte Carlo policy gradient model.

When training the model - the loss function will be equal to:
loss = - log(action taken) * discounted rewards

Is that correct?

I struggle even more to think about how to formulate this as a tf.gradient operation…


You’ve got it right:

loss = -log(pi(a | s; theta)) * sum of discounted rewards.

pi will use the means and variances from your NN and your selected action to calculate the probability of a in the above equation.

sum all the losses from your training batch into a single loss: total_loss

And you generally want to replace sum of discounted rewards with an advantage function:

A(s, a) = sum of discounted rewards - V(s)

There are more sophisticated advantage functions, but that one will work well enough to get you going.

As far as taking the gradient, you can just start with this:

optimizer = tf.train.GradientDescentOptimizer(learning_rate = __)
train_op = optimizer.minimize(total_loss)

Here is a github repo that might help a bit, even though it is a slightly more advanced version of policy gradient:


So my understanding is that for training the continuous case:

  • use the probability distribution as generated by the network now
  • calculate the log probability of selecting action a (the action that was actually taken in the episode - not an action sampled from the probability dist of the network now)
  • use the discounted rewards, minimize loss etc etc

So the probability distribution generated by the network could be very different from the distribution that generated action a that was actually taken in the previous episode? If we have trained the policy.

Using an advantage function is my next project - just want to get my understanding of vanilla MC policy grads solid first!

Regarding the gradient - is there any advantage in using tf.gradients, and applying them? Is it just to get visibility of the actual gradients for use in something like tensorboard?

Thanks for the repo - will check it out. And thanks so much for your posts!