PPO in continuous environnements



Could somebody please help me understand how PPO can be applied in continuous environnements ? Is it by using the same process that in DDPG ?

Thanks !


Hey, following up with the reply from other thread, I have a somewhat working pendulum PPO implementation. Note this is an older version, where I’m collecting data across a fixed number of episodes.

The new version (haven’t finished yet) will be more in line with what’s done for example with OpenAI baseline’s ppo2 (fixed step size, mini-batch updates).

This implementation hyper-parameters haven’t been tweaked really, so room for more alchemy.

I would recommend you get the master branch of Pytorch, which as a much richer distribution library, and you can avoid using Gaussian policy and switch to something better like Beta which has finite support (thus actual bounds on your action).


I have some examples here. I’ve recently improved the LSTM and distributed PPO implementations and will make some updates to GitHub soon https://github.com/Anjum48/rl-examples