Revisiting “Recurrent World Models Facilitate Policy Evolution”

Esteves, Bernardo; Melo, Francisco S.

doi:10.1007/978-3-030-86230-5_26

Revisiting “Recurrent World Models Facilitate Policy Evolution”

Bernardo Esteves¹³ &
Francisco S. Melo^13,14

Conference paper
First Online: 03 September 2021

1819 Accesses
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12981))

Abstract

This paper contributes a detailed analysis of the architecture of Ha and Schmidhuber [5]. The original paper proposes an architecture comprising 3 main components: a “visual” module, a “memory” module; and a controller. As a whole, such architecture performed well in challenging domains. We investigate how each of the aforementioned components contributes individually to the final performance of the system. Our results shed additional light on the role of the different components in the overall behavior of the agent, and illustrate how the different design options affect the behavior of the resulting agent.

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with ref. UIDB/50021/2020. BE acknowledges a research grant from Fundação Calouste Gulbenkian under program “Novos Talentos em IA.”

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bishop, C.: Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, February 1994
Google Scholar
Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)
Google Scholar
Gaier, A., Ha, D.: Weight agnostic neural networks. Adv. Neural Inf. Process. Syst. 32, 5365–5378 (2019)
Google Scholar
Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)
Google Scholar
Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 31, 2450–2462 (2018)
Google Scholar
Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: Proceedings 2nd International Conference on Learning Representations (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Lillicrap, T., et al.: Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference Learning Representations (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)
Article Google Scholar
Risi, S., Stanley, K.: Deep neuroevolution of recurrent and discrete world models. In: Proceedings of the 2019 Genetic and Evolutionary Computation Conference, pp. 456–462 (2019)
Google Scholar
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, Shogi, and go through self-play. Science 362, 1140–1144 (2018)
Article MathSciNet Google Scholar
Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation. In: Proceedings of the 29th International Conference Machine Learning, pp. 1547–1554 (2012)
Google Scholar
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Google Scholar
Tallec, C., Blier, L., Kalainathan, D.: Reproducing “World Models”: Is training the recurrent network really needed? (2018). https://ctallec.github.io/world-models/
Tang, Y., Nguyen, D., Ha, D.: Neuroevolution of self-interpretable agents. CoRR abs/2003.08165 (2020)
Google Scholar
Wang, T., et al.: Benchmarking model-based reinforcement learning. CoRR abs/1907.02057 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

INESC-ID, Lisbon, Portugal
Bernardo Esteves & Francisco S. Melo
Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal
Francisco S. Melo

Authors

Bernardo Esteves
View author publications
You can also search for this author in PubMed Google Scholar
Francisco S. Melo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco S. Melo .

Editor information

Editors and Affiliations

ISEP/GECAD, Polytechnic Institute of Porto, Porto, Portugal
Goreti Marreiros
IST/INESC-ID, University of Lisbon, Porto Salvo, Portugal
Francisco S. Melo
DETI/IEETA, University of Aveiro, Aveiro, Portugal
Nuno Lau
FEUP/LIACC, University of Porto, Porto, Portugal
Henrique Lopes Cardoso
FEUP/LIACC, University of Porto, Porto, Portugal
Luís Paulo Reis

Appendices

AFull Comparative Results

For completeness, we provide in this appendix some additional results not reported in the main body of the paper.

1.1 A.1Ablation Study Additional Results

We include, in Table 6, a comparison between the average score obtained when using the full image and a cropped image. The results complements those portrayed in Table 2, but suggest essentially the same conclusions.

Table 6. Average score on both the game’s cropped and full image

Full size table

Table 7. Average score obtained with \(V_\mathrm{mean}\) trained with images from different sampling methods on the game’s full image.

Full size table

Table 8. Average score and steps to reach 900-score, training the VAE and MDN-RNN components with images obtained using different sampling methods.

Full size table

1.2 A.2Improved Sample Policy

The VAE and MDN-RNN components are trained with a batch of images from the game sampled with a random driving policy. However, it is also important to understand the impact that the policy used to sample the batch of images used to train these components has on the performance of the system as a whole. We thus considered training these components with a mix of images sampled with both a random policy and an expert policy. Results obtained with such a mix sample are denoted as improved sample.

Table 7 shows the performance obtained when the visual component is trained with images from a mixed policy, in contrast with the performance obtained with images from a random policy. We can observe an improvement both in the cropped and the full image cases.

In Table 8 reports the results obtained when we include a memory component. We report both the average score and the time step at which the agent attained a score of 900. Using a better sampling method for training the “visual” and “memory” components, the system as a whole tends to achieve better results—mostly when using cropped images—and reach a high score faster.

BModel details

In this section we provide the details of the architectures used in the different experiments. The overall structure of the neural networks used closely follow those of the original HS paper [5], and we refer to that work for further details.

1.1 B.1 VAE

We start by presenting the details for the VAE. The encoder and decoder are represented in Fig. 7. The encoder comprises 4 convolutional layers where, in each layer, the filter size is always 4, and the stride is 2. The dimension of the latent space is \(N_z=32\). The decoder, in turn, comprises 4 deconvolution layers that reconstruct a \(64\times 64\times 3\) image \(\hat{\boldsymbol{x}}\) from a code \(\boldsymbol{z}\). The VAE is trained using gradient descent with a dataset of previously gathered environment frames.

1.2 B.2 MDN-RNN

The memory module consists of a recurrent neural network (RNN) followed by a mixture density network (MDN) like represented on Fig. 4. The RNN is a single LSTM [7] layer with a hidden size of 256 and the MDN is composed by a single linear fully-connected layer with 5 gaussians per latent unit of \(\boldsymbol{z}\). The MDN-RNN is trained with gradient descent, by using the previous learned VAE and trying to sequentially reconstruct the sequences from the same previously garthered environment dataset.

1.3 B.3 Controller

The original controller presented by Ha et al. [5] consists of a single fully-connected layer with an custom environment specific activation function defined in Table. 9. In this work, we also analyzed different architecture choices like removing the squashing activation function and adding a hidden layer with 24 hidden units and with Tanh activation function. The controller is trained with CMA-ES [6] by interacting with the environment.

Table 9. Custom activation function from Ha et al. [5] implementation

Full size table

1.4 B.4 Hyperparameters

(See Table 10)

Table 10. Hyperparameters used for training the World Models agent

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esteves, B., Melo, F.S. (2021). Revisiting “Recurrent World Models Facilitate Policy Evolution”. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-86230-5_26
Published: 03 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

AFull Comparative Results

1.1 A.1Ablation Study Additional Results

1.2 A.2Improved Sample Policy

BModel details

1.1 B.1 VAE

1.2 B.2 MDN-RNN

1.3 B.3 Controller

1.4 B.4 Hyperparameters

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation