Question about action selection


This question is about the paper titled “Evolution strategies as a scalable alternative…” from OpenAI.

Just wanted to know how do they sample actions in a discrete action space like Atari environments. Is it sampled from softmax distribution like it is done in policy gradient methods or chosen greedily?



In the paper, they explain that the noise is injected in the function parameters, aka: the weights. Each individual is slightly different. Hence in order to measure consistently the performance on the configuration, the actions are chosen greedily.