Question about action selection


#1

This question is about the paper titled “Evolution strategies as a scalable alternative…” from OpenAI.

Just wanted to know how do they sample actions in a discrete action space like Atari environments. Is it sampled from softmax distribution like it is done in policy gradient methods or chosen greedily?


#2

Hello,

In the paper, they explain that the noise is injected in the function parameters, aka: the weights. Each individual is slightly different. Hence in order to measure consistently the performance on the configuration, the actions are chosen greedily.