You’ve got it right:
loss = -log(pi(a | s; theta)) * sum of discounted rewards.
pi will use the means and variances from your NN and your selected action to calculate the probability of a in the above equation.
sum all the losses from your training batch into a single loss: total_loss
And you generally want to replace sum of discounted rewards with an advantage function:
A(s, a) = sum of discounted rewards - V(s)
There are more sophisticated advantage functions, but that one will work well enough to get you going.
As far as taking the gradient, you can just start with this:
optimizer = tf.train.GradientDescentOptimizer(learning_rate = __)
train_op = optimizer.minimize(total_loss)
Here is a github repo that might help a bit, even though it is a slightly more advanced version of policy gradient: