PDD: Pruning Neural Networks During Knowledge Distillation

Dan, Xi; Yang, Wenjie; Zhang, Fuyan; Zhou, Yihang; Yu, Zhuojun; Qiu, Zhen; Zhao, Boyuan; Dong, Zeyu; Huang, Libo; Yang, Chuanguang

doi:10.1007/s12559-024-10350-9

PDD: Pruning Neural Networks During Knowledge Distillation

Research
Published: 31 August 2024

Volume 16, pages 3457–3467, (2024)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Xi Dan²,
Wenjie Yang^1,3,
Fuyan Zhang^1,3,
Yihang Zhou⁴,
Zhuojun Yu⁵,
Zhen Qiu⁶,
Boyuan Zhao⁶,
Zeyu Dong¹,
Libo Huang¹ &
…
Chuanguang Yang¹

431 Accesses
Explore all metrics

Abstract

Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation (PDD) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms

Article 03 January 2023

Exploring the Knowledge Distillation

Knowledge distillation based on projector integration and classifier sharing

Article Open access 20 March 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

No datasets were generated or analyzed specifically for this study. All datasets used in this study are publicly available and can be accessed and downloaded by anyone from open-source repositories.

References

Adriana R, Nicolas B, Ebrahimi KS, Antoine C, Carlo G, Yoshua B. Fitnets: hints for thin deep nets. Proc ICLR. 2015;2(3):1.
Google Scholar
Belagiannis V, Farshad A, Galasso F. Adversarial network compression. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 2018;0–0
Cai L, An Z, Yang C, Xu Y. Softer pruning, incremental regularization. In: 2020 25th international conference on pattern recognition (ICPR). 2021;224–230. IEEE
Cai L, An Z, Yang C, Yan Y, Xu Y. Prior gradient mask guided pruning-aware fine-tuning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:140–148
Covington P, Adams J, Sargin E. Deep neural networks for youtube recommendations. In: Proceedings of the 10th ACM conference on recommender systems. 2016;191–198
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. 2009;248–255. Ieee
Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. 2018 arXiv:1810.04805
Dong X, Huang J, Yang Y, Yan S. More is less: a more complicated network with less inference complexity. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017;5840–5848
Fang G, Ma X, Song M, Mi MB, Wang X. Depgraph: towards any structural pruning. 2023 arXiv:2301.12900
Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. 2015 arXiv:1510.00149
Hassibi B, Stork DG, Wolff GJ. Optimal brain surgeon and general network pruning. In: IEEE international conference on neural networks. 1993;293–299. IEEE
He Y, Kang G, Dong X, Fu Y, Yang Y. Soft filter pruning for accelerating deep convolutional neural networks. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018;2234–2240
He Y, Liu P, Wang Z, Hu Z, Yang Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;4340–4349
He Y, Zhang X, Sun J. Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE international conference on computer vision. 2017;1389–1397
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015arXiv:1503.02531
Hu H, Bai S, Li A, Cui J, Wang L. Dense relation distillation with context-aware aggregation for few-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2021;10185–10194
Huang Z, Wang N. Data-driven sparse structure selection for deep neural networks. In: ECCV. 2018;304–320
Jung S, Lee D, Park T, Moon T. Fair feature distillation for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021;12115–12124
Kim J, Park S. Kwak N. Paraphrasing complex network: Network compression via factor transfer. NeurIPS; 2018. p. 31.
Google Scholar
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–7.
Article Google Scholar
Kovaleva O, Romanov A, Rogers A, Rumshisky A. Revealing the dark secrets of Bert. 2019 arXiv:1908.08593
Krizhevsky A. Learning multiple layers of features from tiny images. Citeseer: Tech. rep; 2009.
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
Article Google Scholar
Lai KH, Zha D, Li Y, Hu X. Dual policy distillation. 2020 arXiv:2006.04061
LeCun Y, Denker J, Solla S. Optimal brain damage. NeurIPS 1989:2
Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–91.
Article Google Scholar
Li H, Kadav A, Durdanovic I, Samet H, Graf HP. Pruning filters for efficient convnets. 2016arXiv:1608.08710
Li L, Gan Z, Cheng Y, Liu J. Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) October 2019
Lin M, Ji R, Wang Y, Zhang Y, Zhang B, Tian Y, Shao L. Hrank: filter pruning using high-rank feature map. In: CVPR. 2020;1529–1538
Lin S, Ji R, Yan C, Zhang B, Cao L, Ye Q, Huang F, Doermann D. Towards optimal structured CNN pruning via generative adversarial learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;2790–2799
Liu J, Tang J, Wu G. Residual feature distillation network for lightweight image super-resolution. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. 2020;41–55. Springer
Liu J, Zhuang B, Zhuang Z, Guo Y, Huang J, Zhu J, Tan M. Discrimination-aware network pruning for deep model compression. IEEE Trans Pattern Anal Mach Intell. 2021;44(8):4035–51.
Google Scholar
Lu Y, Yang W, Zhang Y, Chen Z, Chen J, Xuan Q, Wang Z, Yang, X. Understanding the dynamics of DNNs using graph modularity. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. 2022;225–242. Springer
Niu W, Ma X, Lin S, Wang S, Qian X, Lin X, Wang Y, Ren B. PatDNN: achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020;907–922
Papernot N, McDaniel P, Wu X, Jha S, Swami A. Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE symposium on security and privacy (SP). 2016;582–597. IEEE
Park E, Ahn J, Yoo S. Weighted-entropy-based quantization for deep neural networks. In: CVPR. 2017;5456–5464
Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. 2019 arXiv:1910.01108
Shan Y, Hoens TR, Jiao J, Wang H, Yu D, Mao J. Deep crossing: web-scale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016;255–262
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature 2016;529(7587):484–489
Tang H, Lu Y, Xuan Q. Sr-init: an interpretable layer pruning method. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023;1–5. IEEE
Wu S, Rupprecht C, Vedaldi A. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) June 2020
Yang C, An Z, Cai L, Xu Y. Hierarchical self-supervised augmented knowledge distillation. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI). 2021;1217–1223
Yang C, An Z, Cai L, Xu Y. Knowledge distillation using hierarchical self-supervision augmented distribution. IEEE Transactions on Neural Networks and Learning Systems 2022;1–15
Yang C, An Z, Cai L, Xu Y. Mutual contrastive learning for visual representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:3045–3053
Yang C, An Z, Li C, Diao B, Xu Y. Multi-objective pruning for CNNs using genetic algorithm. In: International Conference on Artificial Neural Networks. 2019;299–305. Springer
Yang C, An Z, Xu Y. Multi-view contrastive learning for online knowledge distillation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021;3750–3754. IEEE
Yang C, An Z, Zhou H, Cai L, Zhi X, Wu J, Xu Y, Zhang Q. Mixskd: self-knowledge distillation from mixup for image recognition. In: European Conference on Computer Vision. 2022;534–551. Springer
Yang C, An Z, Zhou H, Zhuang F, Xu Y, Zhang Q. Online knowledge distillation via mutual contrastive learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(8):10212–27.
Article Google Scholar
Yang C, An Z, Zhu H, Hu X, Zhang K, Xu K, Li C, Xu Y. Gated convolutional networks with hybrid connectivity for image classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34 :12581–12588
Yang C, Yu X, An Z, Xu Y. Categories of response-based, feature-based, and relation-based knowledge distillation. In: Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems, 2023;1–32. Springer
Yang C, Zhou H, An Z, Jiang X, Xu Y, Zhang Q. Cross-image relational knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;12319–12328
Yu R, Li A, Chen CF, Lai JH, Morariu VI, Han X, Gao M, Lin CY, Davis LS. NISP: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;9194–9203
Zhang X, Xie W, Li Y, Lei J, Du Q. Filter pruning via learned representation median in the frequency domain. IEEE Transactions on Cybernetics. 2021;53(5):3165–75.
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the financial support for this research provided by the Beijing Natural Science Foundation under grant 4244098. This support has been invaluable to the completion of this work, enabling us to carry out our research with adequate resources and dedication. We extend our sincere thanks to the Foundation for their support and confidence in our project.

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wenjie Yang, Fuyan Zhang, Zeyu Dong, Libo Huang & Chuanguang Yang
Central University of Finance and Economics, Beijing, China
Xi Dan
The University of Glasgow, Glasgow, UK
Wenjie Yang & Fuyan Zhang
The University of Queensland, Brisbane, Australia
Yihang Zhou
University of Bristol, Bristol, UK
Zhuojun Yu
China University of Mining and Technology, XuZhou, China
Zhen Qiu & Boyuan Zhao

Authors

Xi Dan
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fuyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yihang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhuojun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Boyuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Dong
View author publications
You can also search for this author in PubMed Google Scholar
Libo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chuanguang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xi Dan, Wenjie Yang, Fuyan Zhang, Yihang Zhou, Zhuojun Yu, Zhen Qiu, Boyuan Zhao, Zeyu Dong, Libo Huang and Chuanguang Yang Xi Dan led the conception and design of the study, played a pivotal role in the development of the methodology, and conducted the primary experiments, making the greatest contribution to the work. Wenjie Yang and Boyuan Zhao contributed to the formulation of the experimental design, data analysis, and interpretation, and assisted significantly in drafting and revising the manuscript. Fuyan Zhang was involved in data collection, preprocessing, and in conducting supplementary experiments to validate the findings. Yihang Zhou and Zhen Qiu contributed to the development of the software used for data analysis, including coding and debugging, and participated in the preparation of the experimental setup. Zhuojun Yu provided substantial support in literature review, data interpretation, and was involved in critical discussions that shaped the research conclusions. Zeyu Dong and Zhen Qiu assisted in the technical aspects of the project, including software optimization and troubleshooting, and contributed to the manuscript by providing insights into the methodology. Libo Huang and Chuanguang Yang, as the corresponding author, was instrumental in overseeing the project, refining the research questions, and significantly enhancing the manuscript through critical revisions for important intellectual content. Libo Huang and Chuanguang Yang’s contributions were crucial in aligning the study with high standards of scientific rigor and ensuring the integrity and accuracy of the work. Funding Declaration: Beijing Natural Science Foundation, 4244098

Corresponding author

Correspondence to Chuanguang Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Conflict of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper. No financial or personal relationship with other people or organizations has influenced the work reported in this manuscript, titled “PDD: Pruning Neural Networks During Knowledge Distillation,” Submission ID dccb382d-1a4b-4ee3-a28e-dd6137112c47. This research is conducted purely from an academic perspective, and any affiliations or financial supports are disclosed in the manuscript. Our study has been carried out with a commitment to transparency and honesty in all our research findings and discussions related to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dan, X., Yang, W., Zhang, F. et al. PDD: Pruning Neural Networks During Knowledge Distillation. Cogn Comput 16, 3457–3467 (2024). https://doi.org/10.1007/s12559-024-10350-9

Download citation

Received: 12 March 2024
Accepted: 12 August 2024
Published: 31 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s12559-024-10350-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PDD: Pruning Neural Networks During Knowledge Distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms

Exploring the Knowledge Distillation

Knowledge distillation based on projector integration and classifier sharing

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

PDD: Pruning Neural Networks During Knowledge Distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms

Exploring the Knowledge Distillation

Knowledge distillation based on projector integration and classifier sharing

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation