Abstract
This paper contributes a detailed analysis of the architecture of Ha and Schmidhuber [5]. The original paper proposes an architecture comprising 3 main components: a “visual” module, a “memory” module; and a controller. As a whole, such architecture performed well in challenging domains. We investigate how each of the aforementioned components contributes individually to the final performance of the system. Our results shed additional light on the role of the different components in the overall behavior of the agent, and illustrate how the different design options affect the behavior of the resulting agent.
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with ref. UIDB/50021/2020. BE acknowledges a research grant from Fundação Calouste Gulbenkian under program “Novos Talentos em IA.”
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bishop, C.: Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, February 1994
Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)
Gaier, A., Ha, D.: Weight agnostic neural networks. Adv. Neural Inf. Process. Syst. 32, 5365–5378 (2019)
Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 31, 2450–2462 (2018)
Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: Proceedings 2nd International Conference on Learning Representations (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Lillicrap, T., et al.: Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference Learning Representations (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Risi, S., Stanley, K.: Deep neuroevolution of recurrent and discrete world models. In: Proceedings of the 2019 Genetic and Evolutionary Computation Conference, pp. 456–462 (2019)
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, Shogi, and go through self-play. Science 362, 1140–1144 (2018)
Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation. In: Proceedings of the 29th International Conference Machine Learning, pp. 1547–1554 (2012)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Tallec, C., Blier, L., Kalainathan, D.: Reproducing “World Models”: Is training the recurrent network really needed? (2018). https://ctallec.github.io/world-models/
Tang, Y., Nguyen, D., Ha, D.: Neuroevolution of self-interpretable agents. CoRR abs/2003.08165 (2020)
Wang, T., et al.: Benchmarking model-based reinforcement learning. CoRR abs/1907.02057 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
AFull Comparative Results
For completeness, we provide in this appendix some additional results not reported in the main body of the paper.
1.1 A.1Ablation Study Additional Results
We include, in Table 6, a comparison between the average score obtained when using the full image and a cropped image. The results complements those portrayed in Table 2, but suggest essentially the same conclusions.
1.2 A.2Improved Sample Policy
The VAE and MDN-RNN components are trained with a batch of images from the game sampled with a random driving policy. However, it is also important to understand the impact that the policy used to sample the batch of images used to train these components has on the performance of the system as a whole. We thus considered training these components with a mix of images sampled with both a random policy and an expert policy. Results obtained with such a mix sample are denoted as improved sample.
Table 7 shows the performance obtained when the visual component is trained with images from a mixed policy, in contrast with the performance obtained with images from a random policy. We can observe an improvement both in the cropped and the full image cases.
In Table 8 reports the results obtained when we include a memory component. We report both the average score and the time step at which the agent attained a score of 900. Using a better sampling method for training the “visual” and “memory” components, the system as a whole tends to achieve better results—mostly when using cropped images—and reach a high score faster.
BModel details
In this section we provide the details of the architectures used in the different experiments. The overall structure of the neural networks used closely follow those of the original HS paper [5], and we refer to that work for further details.
1.1 B.1 VAE
We start by presenting the details for the VAE. The encoder and decoder are represented in Fig. 7. The encoder comprises 4 convolutional layers where, in each layer, the filter size is always 4, and the stride is 2. The dimension of the latent space is \(N_z=32\). The decoder, in turn, comprises 4 deconvolution layers that reconstruct a \(64\times 64\times 3\) image \(\hat{\boldsymbol{x}}\) from a code \(\boldsymbol{z}\). The VAE is trained using gradient descent with a dataset of previously gathered environment frames.
1.2 B.2 MDN-RNN
The memory module consists of a recurrent neural network (RNN) followed by a mixture density network (MDN) like represented on Fig. 4. The RNN is a single LSTM [7] layer with a hidden size of 256 and the MDN is composed by a single linear fully-connected layer with 5 gaussians per latent unit of \(\boldsymbol{z}\). The MDN-RNN is trained with gradient descent, by using the previous learned VAE and trying to sequentially reconstruct the sequences from the same previously garthered environment dataset.
1.3 B.3 Controller
The original controller presented by Ha et al. [5] consists of a single fully-connected layer with an custom environment specific activation function defined in Table. 9. In this work, we also analyzed different architecture choices like removing the squashing activation function and adding a hidden layer with 24 hidden units and with Tanh activation function. The controller is trained with CMA-ES [6] by interacting with the environment.
1.4 B.4 Hyperparameters
(See Table 10)
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Esteves, B., Melo, F.S. (2021). Revisiting “Recurrent World Models Facilitate Policy Evolution”. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-86230-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)