PPO in continuous environnements



Could somebody please help me understand how PPO can be applied in continuous environnements ? Is it by using the same process that in DDPG ?

Thanks !


Hey, following up with the reply from other thread, I have a somewhat working pendulum PPO implementation. Note this is an older version, where I’m collecting data across a fixed number of episodes.

The new version (haven’t finished yet) will be more in line with what’s done for example with OpenAI baseline’s ppo2 (fixed step size, mini-batch updates).

This implementation hyper-parameters haven’t been tweaked really, so room for more alchemy.

I would recommend you get the master branch of Pytorch, which as a much richer distribution library, and you can avoid using Gaussian policy and switch to something better like Beta which has finite support (thus actual bounds on your action).


I have some examples here. I’ve recently improved the LSTM and distributed PPO implementations and will make some updates to GitHub soon https://github.com/Anjum48/rl-examples


If I understand well:

  • You have two heads for the policy: one mean and one standard deviation
  • At each frame, you sample an action from the normal distribution defined by the mean head and std head
  • Act, get rewards until the episode finishes
  • Update critic, update policy

I have a few questions however:

  • The action value might very well be negative. How do you manage the log_prob of this ?
  • How would you extend this to several actions ?

EDIT: Actually, I think I figured the first question out. The lof_prob correspond to the probability of taking an action a sampled from the normal distribution defined by the mean and the variance. Right?

Still trying to figure out the second part.


The baselines repo contains an example Mujoco ppo agent - which would be a good place to start. I have slightly modified that agent to run in the simple ContinuousMountainCar environment which works quite well, and converges in a few minutes to a score of about 90.


For the policy, you can have two heads for the mean & std dev, or just a single head for the mean and leave the std dev as a trainable parameter. This is how it’s done in baselines and in my example (since I’ve found this is the most stable), but in the literature, it’s not clear.

For your questions:

  1. It doesn’t matter that the action is negative for calculating log probability - you’re right in your edit
  2. If you look at https://github.com/Anjum48/rl-examples/blob/master/ppo/ppo.py#L142, the policy outputs a mean vector of the same shape of the action dimensions. So you actually sample from an n-dimensional Gaussian distribution at each time step (n or a_dim = 4 for BipedalWalker). When you don’t want any exploration (i.e. training is over), you can just use the mean directly from the policy (see the eval_state function for stochastic & deterministic policies)


If PPO is really usable in continuous action space, then I’m wondering what is the point of DDPG. What would you say ?

I’m currently trying it in CartPol. I’ll select action 0 if the sample from the normal distribution is over 0 else, action 1. I’ll let you know if it works.


CartPole has a discrete action space. You’ll need an environment with a continuous action space to test this, Pendulum-v0 is a good starting point. The action spaces for all the environments are listed here https://github.com/openai/gym/wiki/Table-of-environments.

DDPG is a couple of years older than PPO but is still pretty powerful. In my experience, I’ve found DDPG more difficult to get to work in higher dimensional action spaces than PPO. My guess (I might be wrong here) is that since there are 2 gradient calculations in the policy for DDPG, things can get unstable quite quickly (gradient clipping and batch normalisation might help). Throw in things like LSTMs and distributed workers and things start to get tricky. PPO makes this a lot simpler by having tighter control on each update


Yop, I know that CartPole’s action space is discrete. I wanted to force the agent to select action 0 if output is less than 0 else action 1. Or isn’t it a good idea ?


I guess the issue you might face is if you start forcing the actions in that manner the gradients become non-continuous and could cause problems with converging. Using a softmax is the standard method for discrete action spaces. I guess if you’re feeling adventurous, you could use another distribution like a Binomial, but I’ve not seen this done before


Alright. Well, I’ll go for pendulum first then.
Do you normalize (mean and std) the reward for pendulum ? It seems to me that the magnitude is very big.


In the DPPO paper, they use a running standard deviation to rescale the rewards, which I skipped in the interest of keeping hyperparameters to a minimum (they didn’t specify a rolling window). I do however standardise the advantages per batch (as per the baselines implementation)


Well, I’ve tried hard and I still have no results. I applied it on two environments.

  1. Pendulum: A nightmare. I couldn’t stand seeing those bad mean rewards
  2. A home-made reacher env. In this case, the agent always outputs the same value (different for each training). Kind of a mode collapse. I tried a super simple reacher in which the agent gets as observation the vector between its effector and the target and basically has to output the same thing.

I’ll let you know if I get any success at this.