research-article

Importance Prioritized Policy Distillation

Authors:

Abhishek Gupta,

Zejun MaAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 1420 - 1429

https://doi.org/10.1145/3534678.3539266

Published: 14 August 2022 Publication History

Abstract

Policy distillation (PD) has been widely studied in deep reinforcement learning (RL), while existing PD approaches assume that the demonstration data (i.e., state-action pairs in frames) in a decision making sequence is uniformly distributed. This may bring in unwanted bias since RL is a reward maximizing process instead of simple label matching. Given such an issue, we denote the frame importance as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy. To verify our hypothesis, we analyze why and how frame importance matters in RL settings. Based on the analysis, we propose an importance prioritized PD framework that highlights the training on important frames, so as to learn efficiently. Particularly, the frame importance is measured by the reciprocal of weighted Shannon entropy from a teacher policy's action prescriptions. Experiments on Atari games and policy compression tasks show that capturing the frame importance significantly boosts the performance of the distilled policies.

Supplemental Material

MP4 File

Presentation video

Download
22.90 MB

References

[1]

Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML). 1.

Digital Library

[2]

Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A Distributional Perspective on Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML). 449--458.

Digital Library

[3]

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research (JAIR) 47 (2013), 253--279.

[4]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).

[5]

Jonathon Byrd and Zachary Lipton. 2019. What is the effect of importance weighting in deep learning?. In International Conference on Machine Learning (ICML). PMLR, 872--881.

[6]

Wojciech M Czarnecki, Razvan Pascanu, Simon Osindero, Siddhant Jayakumar, Grzegorz Swirszcz, and Max Jaderberg. 2019. Distilling Policy Distillation. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS). 1331--1340.

[7]

Marc Fischer, Matthew Mirman, Steven Stalder, and Martin Vechev. 2019. Online robustness training for deep reinforcement learning. arXiv preprint arXiv:1911.00887 (2019).

[8]

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, et al. 2018. Noisy Networks For Exploration. In International Conference on Learning Representations (ICLR).

[9]

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. 2018. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning. In Proceedings of 35th International Conference on Machine Learning (ICML), Vol. 80. 1578--1586.

[10]

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge distillation: A survey. International Journal of Computer Vision (IJCV) 129, 6 (2021), 1789--1819.

Digital Library

[11]

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). 3215--3222.

[12]

Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical Review 106, 4 (1957), 620.

[13]

Kwei-Herng Lai, Daochen Zha, Yuening Li, and Xia Hu. 2020. Dual Policy Distillation. In International Joint Conference on Artificial Intelligence (IJCAI). 3146--3152.

[14]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In The International Conference on Learning Representations (ICLR).

[15]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), Vol. 34. 5191--5198.

[16]

Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. 2021. Offline Meta-Reinforcement Learning with Advantage Weighting. In International Conference on Machine Learning (ICML). PMLR, 7780--7791.

[17]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533.

[18]

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. 2019. Advantageweighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 (2019).

[19]

Mary Phuong and Christoph Lampert. 2019. Towards understanding knowledge distillation. In International Conference on Machine Learning (ICML). PMLR, 5142-- 5151.

[20]

Bilal Piot, Matthieu Geist, and Olivier Pietquin. 2016. Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 28, 8 (2016), 1814--1826.

[21]

Dean A Pomerleau. 1991. Efficient training of artificial neural networks for autonomous navigation. Neural Computation 3, 1 (1991), 88--97.

[22]

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). 627--635.

[23]

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. 2015. Policy distillation. arXiv preprint arXiv:1511.06295 (2015).

[24]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354--359.

[25]

Vladimir Vapnik. 1998. Statistical learning theory. Hoboken. Wiley. Wang, K., Tsung, F.(2007). Run-to-run Process Adjust. using Categ. Obs. J. Qual. Technol. 39, 4 (1998), 312.

[26]

Vladimir N Vapnik. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks 10, 5 (1999), 988--999.

Digital Library

[27]

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML). 1995--2003.

[28]

Da Xu, Yuting Ye, and Chuanwei Ruan. 2021. Understanding the role of importance weighting for deep learning. In International Conference on Learning Representations (ICLR).

Cited By

Shu BYao WLu JLiu W(2024)Learning to Control Inverted Pendulum on Surface with Free Angles Through Policy Distillation2024 China Automation Congress (CAC)10.1109/CAC63892.2024.10865313(2750-2755)Online publication date: 1-Nov-2024
https://doi.org/10.1109/CAC63892.2024.10865313
Shin GYun SKim W(2024)A Novel Policy Distillation With WPA-Based Knowledge Filtering Algorithm for Efficient Industrial Robot ControlIEEE Access10.1109/ACCESS.2024.348397012(154514-154525)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3483970
Wu YLiu JGong MMiao QMa WXu C(2024)Joint Semantic Segmentation using representations of LiDAR point clouds and camera imagesInformation Fusion10.1016/j.inffus.2024.102370108:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102370
Show More Cited By

Index Terms

Importance Prioritized Policy Distillation
1. Computing methodologies
  1. Machine learning

Recommendations

Generating Diverse Critics for Conditioned Policy Distillation
GECCO '24 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Quality Diversity Reinforcement Learning (QD-RL) allows for the creation of a large set of behaviorally diverse yet high performing policies. However, these techniques often require the maintenance of a large set of parameters for individual neural ...
Synthesising Reinforcement Learning Policies Through Set-Valued Inductive Rule Learning
Trustworthy AI - Integrating Learning, Optimization and Reasoning
Abstract
Today’s advanced Reinforcement Learning algorithms produce black-box policies, that are often difficult to interpret and trust for a person. We introduce a policy distilling algorithm, building on the CN2 rule mining algorithm, that distills the ...
A New Framework for Multi-Agent Reinforcement Learning Centralized Training and Exploration with Decentralized Execution via Policy Distillation
AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems

Multi-agent deep reinforcement learning demands for highly coordinated environment exploration among all the participating agents. Previous research attempted to address this challenge through learning centralized value functions. However, the common ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

A*STAR AI3 HTPO seed grant C211118016 on Upside-Down Multi-Objective Bayesian Optimization for Few-Shot Design
NTU Data Science and Artificial Intelligence Center
A*STAR Centre for Frontier AI Research

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shu BYao WLu JLiu W(2024)Learning to Control Inverted Pendulum on Surface with Free Angles Through Policy Distillation2024 China Automation Congress (CAC)10.1109/CAC63892.2024.10865313(2750-2755)Online publication date: 1-Nov-2024
https://doi.org/10.1109/CAC63892.2024.10865313
Shin GYun SKim W(2024)A Novel Policy Distillation With WPA-Based Knowledge Filtering Algorithm for Efficient Industrial Robot ControlIEEE Access10.1109/ACCESS.2024.348397012(154514-154525)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3483970
Wu YLiu JGong MMiao QMa WXu C(2024)Joint Semantic Segmentation using representations of LiDAR point clouds and camera imagesInformation Fusion10.1016/j.inffus.2024.102370108:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102370
Zhang HMa HChen Z(2023)FastAct: A Lightweight Actor Compression Framework for Fast Policy Learning2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191108(1-8)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191108
Chen JWang WLiu SLi HYang Y(2023)Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01009(10959-10969)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01009

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten