Skip to main content

Revisiting “Recurrent World Models Facilitate Policy Evolution”

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12981))

Abstract

This paper contributes a detailed analysis of the architecture of Ha and Schmidhuber [5]. The original paper proposes an architecture comprising 3 main components: a “visual” module, a “memory” module; and a controller. As a whole, such architecture performed well in challenging domains. We investigate how each of the aforementioned components contributes individually to the final performance of the system. Our results shed additional light on the role of the different components in the overall behavior of the agent, and illustrate how the different design options affect the behavior of the resulting agent.

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with ref. UIDB/50021/2020. BE acknowledges a research grant from Fundação Calouste Gulbenkian under program “Novos Talentos em IA.”

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bishop, C.: Mixture density networks. Technical Report NCRG/94/004, Neural Computing Research Group, Aston University, February 1994

    Google Scholar 

  2. Brockman, G., et al.: OpenAI Gym. CoRR abs/1606.01540 (2016)

    Google Scholar 

  3. Gaier, A., Ha, D.: Weight agnostic neural networks. Adv. Neural Inf. Process. Syst. 32, 5365–5378 (2019)

    Google Scholar 

  4. Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)

    Google Scholar 

  5. Ha, D., Schmidhuber, J.: Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 31, 2450–2462 (2018)

    Google Scholar 

  6. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evol. Comput. 9(2), 159–195 (2001)

    Article  Google Scholar 

  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: Proceedings 2nd International Conference on Learning Representations (2014)

    Google Scholar 

  9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)

    Google Scholar 

  10. Lillicrap, T., et al.: Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference Learning Representations (2016)

    Google Scholar 

  11. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)

    Article  Google Scholar 

  12. Risi, S., Stanley, K.: Deep neuroevolution of recurrent and discrete world models. In: Proceedings of the 2019 Genetic and Evolutionary Computation Conference, pp. 456–462 (2019)

    Google Scholar 

  13. Silver, D., et al.: A general reinforcement learning algorithm that masters chess, Shogi, and go through self-play. Science 362, 1140–1144 (2018)

    Article  MathSciNet  Google Scholar 

  14. Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation. In: Proceedings of the 29th International Conference Machine Learning, pp. 1547–1554 (2012)

    Google Scholar 

  15. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    Google Scholar 

  16. Tallec, C., Blier, L., Kalainathan, D.: Reproducing “World Models”: Is training the recurrent network really needed? (2018). https://ctallec.github.io/world-models/

  17. Tang, Y., Nguyen, D., Ha, D.: Neuroevolution of self-interpretable agents. CoRR abs/2003.08165 (2020)

    Google Scholar 

  18. Wang, T., et al.: Benchmarking model-based reinforcement learning. CoRR abs/1907.02057 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco S. Melo .

Editor information

Editors and Affiliations

Appendices

AFull Comparative Results

For completeness, we provide in this appendix some additional results not reported in the main body of the paper.

1.1 A.1Ablation Study Additional Results

We include, in Table 6, a comparison between the average score obtained when using the full image and a cropped image. The results complements those portrayed in Table 2, but suggest essentially the same conclusions.

Table 6. Average score on both the game’s cropped and full image
Table 7. Average score obtained with \(V_\mathrm{mean}\) trained with images from different sampling methods on the game’s full image.
Table 8. Average score and steps to reach 900-score, training the VAE and MDN-RNN components with images obtained using different sampling methods.
Fig. 7.
figure 7

VAE architectural details from the original World Models implementation [5]

1.2 A.2Improved Sample Policy

The VAE and MDN-RNN components are trained with a batch of images from the game sampled with a random driving policy. However, it is also important to understand the impact that the policy used to sample the batch of images used to train these components has on the performance of the system as a whole. We thus considered training these components with a mix of images sampled with both a random policy and an expert policy. Results obtained with such a mix sample are denoted as improved sample.

Table 7 shows the performance obtained when the visual component is trained with images from a mixed policy, in contrast with the performance obtained with images from a random policy. We can observe an improvement both in the cropped and the full image cases.

In Table 8 reports the results obtained when we include a memory component. We report both the average score and the time step at which the agent attained a score of 900. Using a better sampling method for training the “visual” and “memory” components, the system as a whole tends to achieve better results—mostly when using cropped images—and reach a high score faster.

BModel details

In this section we provide the details of the architectures used in the different experiments. The overall structure of the neural networks used closely follow those of the original HS paper [5], and we refer to that work for further details.

1.1 B.1 VAE

We start by presenting the details for the VAE. The encoder and decoder are represented in Fig. 7. The encoder comprises 4 convolutional layers where, in each layer, the filter size is always 4, and the stride is 2. The dimension of the latent space is \(N_z=32\). The decoder, in turn, comprises 4 deconvolution layers that reconstruct a \(64\times 64\times 3\) image \(\hat{\boldsymbol{x}}\) from a code \(\boldsymbol{z}\). The VAE is trained using gradient descent with a dataset of previously gathered environment frames.

1.2 B.2 MDN-RNN

The memory module consists of a recurrent neural network (RNN) followed by a mixture density network (MDN) like represented on Fig. 4. The RNN is a single LSTM [7] layer with a hidden size of 256 and the MDN is composed by a single linear fully-connected layer with 5 gaussians per latent unit of \(\boldsymbol{z}\). The MDN-RNN is trained with gradient descent, by using the previous learned VAE and trying to sequentially reconstruct the sequences from the same previously garthered environment dataset.

1.3 B.3 Controller

The original controller presented by Ha et al. [5] consists of a single fully-connected layer with an custom environment specific activation function defined in Table. 9. In this work, we also analyzed different architecture choices like removing the squashing activation function and adding a hidden layer with 24 hidden units and with Tanh activation function. The controller is trained with CMA-ES [6] by interacting with the environment.

Table 9. Custom activation function from Ha et al. [5] implementation

1.4 B.4 Hyperparameters

(See Table 10)

Table 10. Hyperparameters used for training the World Models agent

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Esteves, B., Melo, F.S. (2021). Revisiting “Recurrent World Models Facilitate Policy Evolution”. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86230-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86229-9

  • Online ISBN: 978-3-030-86230-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics