Skip to main content
Log in

An exploratory rollout policy for imagination-augmented agents

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Typical reinforcement learning methods usually lack planning and thus require large amounts of training data to achieve the expected performance. Imagination-Augmented Agents(I2A) based on a model-based method learns to extract information from the imagined trajectories to construct implicit plans and show improved data efficiency and performance. However, in I2A, these imagined trajectories are generated by a shared rollout policy, which makes these trajectories look similar and contain little information. We propose an exploratory rollout policy named E-I2A. When the agent’s performance is poor, E-I2A produces diversity in the imagined trajectories that are more informative. When the agent’s performance is improved with training, the trajectories generated by E-I2A are consistent with agent trajectories in the real environment and produce high rewards. To achieve this, first we formulate the novelty of one state through training an inverse dynamic model and then the agent picks the states with the highest novelty to generate diverse trajectories. Simultaneously, we train a distilled value function model to estimate the expected return of one state. By doing this, we can imagine the state with the highest return that makes the imagined trajectories consistent with the real trajectories. Finally, we propose an adaptive method to improve the agent’s performance that produces consistent imagined trajectories that were originally very diverse. Our method demonstrates improved performance and data efficiency through offering more information when making decisions. We evaluated E-I2A on several challenging domains including Minipacman and Sokoban; E-I2A can outperform several baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://github.com/sracaniere

  2. https://github.com/mpSchrader/gym-sokoban

  3. https://github.com/ikostrikov/pytorch-a3c

References

  1. Andrychowicz M, Crow D, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Abbeel P, Zaremba W (2017) Hindsight experience replay. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 5055–5065

  2. Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, pp 449–458

  3. Browne C, Powley EJ, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Liebana DP, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intellig AI Games 4(1):1–43

    Article  Google Scholar 

  4. Chiappa S, Racaniėre S, Wierstra D, Mohamed S (2017) Recurrent environment simulators

  5. Doya K, Samejima K, Katagiri K, Kawato M (2002) Multiple model-based reinforcement learning. Neural Comput 14(6):1347–1369

    Article  MATH  Google Scholar 

  6. Feinberg V, Wan A, Stoica I, Jordan MI, Gonzalez JE, Levine S (2018) Model-based value estimation for efficient model-free reinforcement learning

  7. Finn C, Levine S (2017) Deep visual foresight for planning robot motion. In: 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, pp 2786–2793

  8. Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2017) Noisy networks for exploration. CoRR arXiv:1706.10295

  9. Guu K, Pasupat P, Liu EZ, Liang P (2017) From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Long Papers, Vancouver, Vol 1, pp 1051–1062

  10. Ha D, Schmidhuber J (2018) World models

  11. van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, pp 2094–2100

  12. Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar MG, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, pp 3215–3222

  13. Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. I J Robot Res 32:1238–1274

    Article  Google Scholar 

  14. Konda V (2002) Actor-critic algorithms. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge

  15. Lample G, Chaplot DS (2017) Playing FPS games with deep reinforcement learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, pp 2140–2146

  16. Lenz I, Knepper RA, Saxena A (2015) Deepmpc: Learning deep latent features for model predictive control. In: Robotics: Science and Systems XI. Sapienza University of Rome, Rome

  17. Levine S, Koltun V (2013) Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, pp 1–9

  18. Li Y (2017) Deep reinforcement learning: An overview

  19. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning

  20. Michalski V, Memisevic R, Konda KR (2014) Modeling deep temporal dependencies with recurrent ”grammar cells”. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, pp 1925–1933

  21. Mittelman R, Kuipers B, Savarese S, Lee H (2014) Structured recurrent temporal restricted boltzmann machines. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, pp 1647–1655

  22. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, pp 1928–1937

  23. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning

  24. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  25. Oh J, Guo X, Lee H, Lewis RL, Singh S (2015) Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, pp 2863–2871

  26. Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmȧssan, Stockholm, pp 3875–3884

  27. Oh J, Singh S, Lee H (2017) Value prediction network. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 6120–6130

  28. Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, pp 488–489

  29. Racaniėre S, Weber T (2017) Imagination-augmented agents for deep reinforcement learning. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 5694–5705

  30. Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. CoRR 1511.05952

  31. Silver D, van Hasselt H, Hessel M, Schaul T, Guez A, Harley T, Dulac-Arnold G, Reichert DP, Rabinowitz NC, Barreto A, Degris T (2017) The predictron: End-to-end learning and planning. In: Proceedings of the 34th International Conference on Machine Learning, ICML, Sydney, NSW, pp 3191–3199

  32. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  Google Scholar 

  33. Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap TP, Simonyan K, Hassabis D (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR 1712.01815

  34. Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller MA (2014) Deterministic policy gradient algorithms. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, pp 387–395

  35. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, pp 843–852

  36. Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. In: NIPS Workshop

  37. Sutton RS (1988) Learning to predict by the methods of temporal differences. Mach Learn 3:9–44

    Google Scholar 

  38. Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bullet 2 (4):160–163

    Article  Google Scholar 

  39. Sutton RS, Barto AG, Bach F et al (1998) Reinforcement learning: An introduction. MIT press, Cambridge

  40. Sutton RS, McAllester DA, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver], pp 1057–1063

  41. Tamar A, Levine S, Abbeel P, Wu Y, Thomas G (2016) Value iteration networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, pp 2146–2154

  42. Tesauro G (1995) Temporal difference learning and td-gammon. ICGA J 18(2):88

    Article  Google Scholar 

  43. Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, de Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, pp 1995–2003

  44. Watkins CJCH, Dayan P (1992) Technical note q-learning. Mach Learn 8:279–292

    MATH  Google Scholar 

  45. Watter M, Springenberg JT, Boedecker J, Riedmiller MA (2015) Embed to control: A locally linear latent dynamics model for control from raw images. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, pp 2746–2754

  46. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256

    MATH  Google Scholar 

  47. Xia Y, He D, Qin T, Wang L, Yu N, Liu TY, Ma WY (2016) Dual learning for machine translation. In: NIPS

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61671175, 61672190) and the Lab of Space Optoelectronic Measurement&Perception (No:LabSOMP-2018-01)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, P., Zhao, Y., Zhao, W. et al. An exploratory rollout policy for imagination-augmented agents. Appl Intell 49, 3749–3764 (2019). https://doi.org/10.1007/s10489-019-01484-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01484-7

Keywords

Navigation