skip to main content
10.1145/3534678.3539266acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Importance Prioritized Policy Distillation

Published: 14 August 2022 Publication History

Abstract

Policy distillation (PD) has been widely studied in deep reinforcement learning (RL), while existing PD approaches assume that the demonstration data (i.e., state-action pairs in frames) in a decision making sequence is uniformly distributed. This may bring in unwanted bias since RL is a reward maximizing process instead of simple label matching. Given such an issue, we denote the frame importance as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy. To verify our hypothesis, we analyze why and how frame importance matters in RL settings. Based on the analysis, we propose an importance prioritized PD framework that highlights the training on important frames, so as to learn efficiently. Particularly, the frame importance is measured by the reciprocal of weighted Shannon entropy from a teacher policy's action prescriptions. Experiments on Atari games and policy compression tasks show that capturing the frame importance significantly boosts the performance of the distilled policies.

Supplemental Material

MP4 File
Presentation video

References

[1]
Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML). 1.
[2]
Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A Distributional Perspective on Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML). 449--458.
[3]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR) 47 (2013), 253--279.
[4]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
[5]
Jonathon Byrd and Zachary Lipton. 2019. What is the effect of importance weighting in deep learning?. In International Conference on Machine Learning (ICML). PMLR, 872--881.
[6]
Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. 2019. Distilling Policy Distillation. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS). 1331--1340.
[7]
Marc Fischer, Matthew Mirman, Steven Stalder, and Martin Vechev. 2019. Online robustness training for deep reinforcement learning. arXiv preprint arXiv:1911.00887 (2019).
[8]
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, et al. 2018. Noisy Networks For Exploration. In International Conference on Learning Representations (ICLR).
[9]
Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. 2018. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. In Proceedings of 35th International Conference on Machine Learning (ICML), Vol. 80. 1578--1586.
[10]
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision (IJCV) 129, 6 (2021), 1789--1819.
[11]
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). 3215--3222.
[12]
Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical Review 106, 4 (1957), 620.
[13]
Kwei-Herng Lai, Daochen Zha, Yuening Li, and Xia Hu. 2020. Dual Policy Distillation. In International Joint Conference on Artificial Intelligence (IJCAI). 3146--3152.
[14]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In The International Conference on Learning Representations (ICLR).
[15]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), Vol. 34. 5191--5198.
[16]
Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. 2021. Offline Meta-Reinforcement Learning with Advantage Weighting. In International Conference on Machine Learning (ICML). PMLR, 7780--7791.
[17]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.
[18]
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. 2019. Advantageweighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019).
[19]
Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International Conference on Machine Learning (ICML). PMLR, 5142-- 5151.
[20]
Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2016. Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 28, 8 (2016), 1814--1826.
[21]
Dean A Pomerleau. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3, 1 (1991), 88--97.
[22]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 627--635.
[23]
Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2015. Policy distillation. arXiv preprint arXiv:1511.06295 (2015).
[24]
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354--359.
[25]
Vladimir Vapnik. 1998. Statistical learning theory. Hoboken. Wiley. Wang, K., Tsung, F.(2007). Run-to-run Process Adjust. using Categ. Obs. J. Qual. Technol. 39, 4 (1998), 312.
[26]
Vladimir N Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999.
[27]
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML). 1995--2003.
[28]
Da Xu, Yuting Ye, and Chuanwei Ruan. 2021. Understanding the role of importance weighting for deep learning. In International Conference on Learning Representations (ICLR).

Cited By

View all
  • (2024)Learning to Control Inverted Pendulum on Surface with Free Angles Through Policy Distillation2024 China Automation Congress (CAC)10.1109/CAC63892.2024.10865313(2750-2755)Online publication date: 1-Nov-2024
  • (2024)A Novel Policy Distillation With WPA-Based Knowledge Filtering Algorithm for Efficient Industrial Robot ControlIEEE Access10.1109/ACCESS.2024.348397012(154514-154525)Online publication date: 2024
  • (2024)Joint Semantic Segmentation using representations of LiDAR point clouds and camera imagesInformation Fusion10.1016/j.inffus.2024.102370108:COnline publication date: 1-Aug-2024
  • Show More Cited By

Index Terms

  1. Importance Prioritized Policy Distillation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2022
    5033 pages
    ISBN:9781450393850
    DOI:10.1145/3534678
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 August 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. importance prioritization
    2. policy distillation
    3. reinforcement learning

    Qualifiers

    • Research-article

    Funding Sources

    • A*STAR AI3 HTPO seed grant C211118016 on Upside-Down Multi-Objective Bayesian Optimization for Few-Shot Design
    • NTU Data Science and Artificial Intelligence Center
    • A*STAR Centre for Frontier AI Research

    Conference

    KDD '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)68
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Learning to Control Inverted Pendulum on Surface with Free Angles Through Policy Distillation2024 China Automation Congress (CAC)10.1109/CAC63892.2024.10865313(2750-2755)Online publication date: 1-Nov-2024
    • (2024)A Novel Policy Distillation With WPA-Based Knowledge Filtering Algorithm for Efficient Industrial Robot ControlIEEE Access10.1109/ACCESS.2024.348397012(154514-154525)Online publication date: 2024
    • (2024)Joint Semantic Segmentation using representations of LiDAR point clouds and camera imagesInformation Fusion10.1016/j.inffus.2024.102370108:COnline publication date: 1-Aug-2024
    • (2023)FastAct: A Lightweight Actor Compression Framework for Fast Policy Learning2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191108(1-8)Online publication date: 18-Jun-2023
    • (2023)Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01009(10959-10969)Online publication date: 1-Oct-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media