You’ve got it right:

loss = -log(pi(a | s; theta)) * sum of discounted rewards.

pi will use the means and variances from your NN and your selected action to calculate the probability of a in the above equation.

sum all the losses from your training batch into a single loss: total_loss

And you generally want to replace sum of discounted rewards with an advantage function:

A(s, a) = sum of discounted rewards - V(s)

There are more sophisticated advantage functions, but that one will work well enough to get you going.

As far as taking the gradient, you can just start with this:

optimizer = tf.train.GradientDescentOptimizer(learning_rate = __)

train_op = optimizer.minimize(total_loss)

Here is a github repo that might help a bit, even though it is a slightly more advanced version of policy gradient:

https://github.com/pat-coady/trpo