An exploratory rollout policy for imagination-augmented agents

Liu, Peng; Zhao, Yingnan; Zhao, Wei; Tang, Xianglong; Yang, Zichan

doi:10.1007/s10489-019-01484-7

An exploratory rollout policy for imagination-augmented agents

Published: 03 May 2019

Volume 49, pages 3749–3764, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Peng Liu¹,
Yingnan Zhao¹,
Wei Zhao¹,
Xianglong Tang¹ &
…
Zichan Yang²

481 Accesses
1 Citation
Explore all metrics

Abstract

Typical reinforcement learning methods usually lack planning and thus require large amounts of training data to achieve the expected performance. Imagination-Augmented Agents(I2A) based on a model-based method learns to extract information from the imagined trajectories to construct implicit plans and show improved data efficiency and performance. However, in I2A, these imagined trajectories are generated by a shared rollout policy, which makes these trajectories look similar and contain little information. We propose an exploratory rollout policy named E-I2A. When the agent’s performance is poor, E-I2A produces diversity in the imagined trajectories that are more informative. When the agent’s performance is improved with training, the trajectories generated by E-I2A are consistent with agent trajectories in the real environment and produce high rewards. To achieve this, first we formulate the novelty of one state through training an inverse dynamic model and then the agent picks the states with the highest novelty to generate diverse trajectories. Simultaneously, we train a distilled value function model to estimate the expected return of one state. By doing this, we can imagine the state with the highest return that makes the imagined trajectories consistent with the real trajectories. Finally, we propose an adaptive method to improve the agent’s performance that produces consistent imagined trajectories that were originally very diverse. Our method demonstrates improved performance and data efficiency through offering more information when making decisions. We evaluated E-I2A on several challenging domains including Minipacman and Sokoban; E-I2A can outperform several baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balancing Exploration and Exploitation in Self-imitation Learning

Long-Term Exploration in Persistent MDPs

Exploration via Progress-Driven Intrinsic Rewards

Notes

https://github.com/sracaniere
https://github.com/mpSchrader/gym-sokoban
https://github.com/ikostrikov/pytorch-a3c

References

Andrychowicz M, Crow D, Ray A, Schneider J, Fong R, Welinder P, McGrew B, Tobin J, Abbeel P, Zaremba W (2017) Hindsight experience replay. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 5055–5065
Bellemare MG, Dabney W, Munos R (2017) A distributional perspective on reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, pp 449–458
Browne C, Powley EJ, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Liebana DP, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intellig AI Games 4(1):1–43
Article Google Scholar
Chiappa S, Racaniėre S, Wierstra D, Mohamed S (2017) Recurrent environment simulators
Doya K, Samejima K, Katagiri K, Kawato M (2002) Multiple model-based reinforcement learning. Neural Comput 14(6):1347–1369
Article MATH Google Scholar
Feinberg V, Wan A, Stoica I, Jordan MI, Gonzalez JE, Levine S (2018) Model-based value estimation for efficient model-free reinforcement learning
Finn C, Levine S (2017) Deep visual foresight for planning robot motion. In: 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, pp 2786–2793
Fortunato M, Azar MG, Piot B, Menick J, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2017) Noisy networks for exploration. CoRR arXiv:1706.10295
Guu K, Pasupat P, Liu EZ, Liang P (2017) From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Long Papers, Vancouver, Vol 1, pp 1051–1062
Ha D, Schmidhuber J (2018) World models
van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, pp 2094–2100
Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar MG, Silver D (2018) Rainbow: Combining improvements in deep reinforcement learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, pp 3215–3222
Kober J, Bagnell JA, Peters J (2013) Reinforcement learning in robotics: A survey. I J Robot Res 32:1238–1274
Article Google Scholar
Konda V (2002) Actor-critic algorithms. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge
Lample G, Chaplot DS (2017) Playing FPS games with deep reinforcement learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, pp 2140–2146
Lenz I, Knepper RA, Saxena A (2015) Deepmpc: Learning deep latent features for model predictive control. In: Robotics: Science and Systems XI. Sapienza University of Rome, Rome
Levine S, Koltun V (2013) Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, pp 1–9
Li Y (2017) Deep reinforcement learning: An overview
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2015) Continuous control with deep reinforcement learning
Michalski V, Memisevic R, Konda KR (2014) Modeling deep temporal dependencies with recurrent ”grammar cells”. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, pp 1925–1933
Mittelman R, Kuipers B, Savarese S, Lee H (2014) Structured recurrent temporal restricted boltzmann machines. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, pp 1647–1655
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, pp 1928–1937
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller MA, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Oh J, Guo X, Lee H, Lewis RL, Singh S (2015) Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, pp 2863–2871
Oh J, Guo Y, Singh S, Lee H (2018) Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmȧssan, Stockholm, pp 3875–3884
Oh J, Singh S, Lee H (2017) Value prediction network. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 6120–6130
Pathak D, Agrawal P, Efros AA, Darrell T (2017) Curiosity-driven exploration by self-supervised prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, pp 488–489
Racaniėre S, Weber T (2017) Imagination-augmented agents for deep reinforcement learning. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, pp 5694–5705
Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. CoRR 1511.05952
Silver D, van Hasselt H, Hessel M, Schaul T, Guez A, Harley T, Dulac-Arnold G, Reichert DP, Rabinowitz NC, Barreto A, Degris T (2017) The predictron: End-to-end learning and planning. In: Proceedings of the 34th International Conference on Machine Learning, ICML, Sydney, NSW, pp 3191–3199
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Article Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap TP, Simonyan K, Hassabis D (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR 1712.01815
Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller MA (2014) Deterministic policy gradient algorithms. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, pp 387–395
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, pp 843–852
Stadie BC, Levine S, Abbeel P (2015) Incentivizing exploration in reinforcement learning with deep predictive models. In: NIPS Workshop
Sutton RS (1988) Learning to predict by the methods of temporal differences. Mach Learn 3:9–44
Google Scholar
Sutton RS (1991) Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bullet 2 (4):160–163
Article Google Scholar
Sutton RS, Barto AG, Bach F et al (1998) Reinforcement learning: An introduction. MIT press, Cambridge
Sutton RS, McAllester DA, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver], pp 1057–1063
Tamar A, Levine S, Abbeel P, Wu Y, Thomas G (2016) Value iteration networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, pp 2146–2154
Tesauro G (1995) Temporal difference learning and td-gammon. ICGA J 18(2):88
Article Google Scholar
Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, de Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, pp 1995–2003
Watkins CJCH, Dayan P (1992) Technical note q-learning. Mach Learn 8:279–292
MATH Google Scholar
Watter M, Springenberg JT, Boedecker J, Riedmiller MA (2015) Embed to control: A locally linear latent dynamics model for control from raw images. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, pp 2746–2754
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256
MATH Google Scholar
Xia Y, He D, Qin T, Wang L, Yu N, Liu TY, Ma WY (2016) Dual learning for machine translation. In: NIPS

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61671175, 61672190) and the Lab of Space Optoelectronic Measurement&Perception (No:LabSOMP-2018-01)

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Peng Liu, Yingnan Zhao, Wei Zhao & Xianglong Tang
School of Computer Science and Information Engineering, Harbin Normal University, Harbin, China
Zichan Yang

Authors

Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yingnan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xianglong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zichan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, P., Zhao, Y., Zhao, W. et al. An exploratory rollout policy for imagination-augmented agents. Appl Intell 49, 3749–3764 (2019). https://doi.org/10.1007/s10489-019-01484-7

Download citation

Published: 03 May 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s10489-019-01484-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An exploratory rollout policy for imagination-augmented agents

Abstract

Access this article

Similar content being viewed by others

Balancing Exploration and Exploitation in Self-imitation Learning

Long-Term Exploration in Persistent MDPs

Exploration via Progress-Driven Intrinsic Rewards

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An exploratory rollout policy for imagination-augmented agents

Abstract

Access this article

Similar content being viewed by others

Balancing Exploration and Exploitation in Self-imitation Learning

Long-Term Exploration in Persistent MDPs

Exploration via Progress-Driven Intrinsic Rewards

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation