Related to: [Improved Q-learning with continuous actions]

I have been working on the request for research “Improved Q-learning with continuous actions” as a personal project. Due to starting my masters degree soon, I will likely discontinue my work. Therefore, I would like to share my progress in hopes that someone will find it useful or maybe even continue the project.

All code is available on GitHub.

I will assume that the reader is familiar with OpenAI Gym MuJoCo environments and the Continuous Deep Q-Learning with Model-based Acceleration paper.

The project focus was to solve various OpenAI Gym environments with continuous action spaces, using a fixed algorithm with fixed hyperparameters. I implemented the “normalized advantage function” (NAF) algorithm from the paper mentioned above and created a code environment for running evaluation experiments on OpenAI Gym environments. I experimented with different hyperparameterisations and covariance matrices for the advantage function term, namely the covariance used in the original paper, an identity covariance and a diagonal covariance (I have also added an untested “square covariance”).

Unfortunately, I was unable to perform important experiments due to limited computational resources (namely evaluating fixed-algorithm/fixed-hyperparameter performance across several environments with several thousand episodes per environment). Furthermore, I used an old version of MuJoCo (1.31) due to no longer having access to my previous university email and thus student licence.

My primary focus was on the following MuJoCo environments: InvertedPendulum, InvertedDoublePendulum, Pendulum, Reacher and Humanoid.

My main observations were the following:

- The algorithm was easily able to solve simple environments such as InvertedPendulum, InvertedDoublePendulum and Pendulum. I was unable to solve Reacher and Humanoid environments. However, this was likely due to my limited computational resources: The one time I let an experiment on the Humanoid environment run for several thousand episodes (~6 hours on an i3-6100U 2.3 GHz CPU), I observed an increasing performance with time.
- I found it advantageous to use normalized actions (not to be confused with the meaning of “normalized” in the algorithm name) in the NAF algorithm and rescaling them according to the action space domain for interaction with the environment.
- For simple environments such as InvertedPendulum, InvertedDoublePendulum and Pendulum, I observed better performance using an identity covariance matrix. For more complex environments such as Humanoid and Reacher, the original covariance gave better performance. Diagonal covariance gave poor performance relative to both identity and original covariance. As the original covariance is more expressive but requires learning more parameters, these observations make sense: For simple environments where the state-to-action function can be accurately approximated using the NAF algorithm with a simple advantage term, using identity covariance solves the environment faster as less parameters have to be learned. For more complex environments, a simple advantage term with identity covariance is not expressive enough to approximate the state-to-action function.
- The algorithm was highly sensitive to weight initialization and very unstable without high initial exploration noise.
- For balancing environments (environments where tasks involved balancing an object, such as InvertedPendulum, InvertedDoublePendulum and Pendulum) adding random (Gaussian) noise to actions generally gave better performance than randomly sampling actions with a certain probability. The opposite was true for non-balancing environments such as Humanoid. However, adding random noise instead of randomly sampling decreased stability for balancing environments.
- Differences in reward magnitudes between environments were a challenge when using the original covariance. Unless the learning rate or batch size was decreased for environments with large rewards (e.g. Humanoid relative to InvertedPendulum), the weights associated with the neural network outputs that provided input to the covariance matrix were likely to explode. A hack solution was to apply a sigmoid or tanh activation function to these outputs. A better solution (which I did not explore) could involve normalization of the rewards to increase robustness across environments.

If you have any questions, please do not hesitate to contact me. I would be very happy to see someone continue the project or otherwise use my code or experience. I will also gladly explain my code (including how to use it), the relevant parts of the original paper, the neural network architecture used or my intuition of how the NAF algorithm works (if several people request this, I will explain it here).