ABSTRACT
Policy distillation (PD) has been widely studied in deep reinforcement learning (RL), while existing PD approaches assume that the demonstration data (i.e., state-action pairs in frames) in a decision making sequence is uniformly distributed. This may bring in unwanted bias since RL is a reward maximizing process instead of simple label matching. Given such an issue, we denote the frame importance as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy. To verify our hypothesis, we analyze why and how frame importance matters in RL settings. Based on the analysis, we propose an importance prioritized PD framework that highlights the training on important frames, so as to learn efficiently. Particularly, the frame importance is measured by the reciprocal of weighted Shannon entropy from a teacher policy's action prescriptions. Experiments on Atari games and policy compression tasks show that capturing the frame importance significantly boosts the performance of the distilled policies.
Supplemental Material
- Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML). 1.Google ScholarDigital Library
- Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A Distributional Perspective on Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML). 449--458.Google ScholarDigital Library
- Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR) 47 (2013), 253--279.Google ScholarCross Ref
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).Google Scholar
- Jonathon Byrd and Zachary Lipton. 2019. What is the effect of importance weighting in deep learning?. In International Conference on Machine Learning (ICML). PMLR, 872--881.Google Scholar
- Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. 2019. Distilling Policy Distillation. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS). 1331--1340.Google Scholar
- Marc Fischer, Matthew Mirman, Steven Stalder, and Martin Vechev. 2019. Online robustness training for deep reinforcement learning. arXiv preprint arXiv:1911.00887 (2019).Google Scholar
- Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, et al. 2018. Noisy Networks For Exploration. In International Conference on Learning Representations (ICLR).Google Scholar
- Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. 2018. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. In Proceedings of 35th International Conference on Machine Learning (ICML), Vol. 80. 1578--1586.Google Scholar
- Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision (IJCV) 129, 6 (2021), 1789--1819.Google ScholarDigital Library
- Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). 3215--3222.Google ScholarCross Ref
- Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical Review 106, 4 (1957), 620.Google ScholarCross Ref
- Kwei-Herng Lai, Daochen Zha, Yuening Li, and Xia Hu. 2020. Dual Policy Distillation. In International Joint Conference on Artificial Intelligence (IJCAI). 3146--3152.Google Scholar
- Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning.. In The International Conference on Learning Representations (ICLR).Google Scholar
- Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), Vol. 34. 5191--5198.Google ScholarCross Ref
- Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. 2021. Offline Meta-Reinforcement Learning with Advantage Weighting. In International Conference on Machine Learning (ICML). PMLR, 7780--7791.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.Google Scholar
- Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. 2019. Advantageweighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019).Google Scholar
- Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International Conference on Machine Learning (ICML). PMLR, 5142-- 5151.Google Scholar
- Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2016. Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 28, 8 (2016), 1814--1826.Google ScholarCross Ref
- Dean A Pomerleau. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3, 1 (1991), 88--97.Google ScholarCross Ref
- Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 627--635.Google Scholar
- Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2015. Policy distillation. arXiv preprint arXiv:1511.06295 (2015).Google Scholar
- David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354--359.Google Scholar
- Vladimir Vapnik. 1998. Statistical learning theory. Hoboken. Wiley. Wang, K., Tsung, F.(2007). Run-to-run Process Adjust. using Categ. Obs. J. Qual. Technol. 39, 4 (1998), 312.Google Scholar
- Vladimir N Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999.Google ScholarDigital Library
- Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML). 1995--2003.Google Scholar
- Da Xu, Yuting Ye, and Chuanwei Ruan. 2021. Understanding the role of importance weighting for deep learning. In International Conference on Learning Representations (ICLR).Google Scholar
Index Terms
- Importance Prioritized Policy Distillation
Recommendations
Importance sampling techniques for policy optimization
How can we effectively exploit the collected samples when solving a continuous control task with Reinforcement Learning? Recent results have empirically demonstrated that multiple policy optimization steps can be performed with the same batch by using off-...
Synthesising Reinforcement Learning Policies Through Set-Valued Inductive Rule Learning
Trustworthy AI - Integrating Learning, Optimization and ReasoningAbstractToday’s advanced Reinforcement Learning algorithms produce black-box policies, that are often difficult to interpret and trust for a person. We introduce a policy distilling algorithm, building on the CN2 rule mining algorithm, that distills the ...
Importance sampling in reinforcement learning with an estimated behavior policy
AbstractIn reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires ...
Comments