I’m trying to implement my own PPO algorithm. I have a question about the ratio that is then used for clipping the loss.
Here’s the plot:
- The agent first meets the environment and generate a sample of trajectories. Among these, we record states, actions and log probabilities of chosen actions.
- The first update isn’t very relevant since the ratio has nothing to compare (right ?). However, we save the log_prob as old policy
- We then generate a second sample of trajectories.
- Now, for the update, we want to compare the old policy with the new. But how ? Do we use the old states (from sample one) to compute log probabilities with the new policy and then have a ratio ? Because that means that we should record also both states visited and actions taken in sample one. Or do we use the new data from sample two ?
Thanks for your ideas !