Abstract
Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation (PDD) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
No datasets were generated or analyzed specifically for this study. All datasets used in this study are publicly available and can be accessed and downloaded by anyone from open-source repositories.
References
Adriana R, Nicolas B, Ebrahimi KS, Antoine C, Carlo G, Yoshua B. Fitnets: hints for thin deep nets. Proc ICLR. 2015;2(3):1.
Belagiannis V, Farshad A, Galasso F. Adversarial network compression. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 2018;0–0
Cai L, An Z, Yang C, Xu Y. Softer pruning, incremental regularization. In: 2020 25th international conference on pattern recognition (ICPR). 2021;224–230. IEEE
Cai L, An Z, Yang C, Yan Y, Xu Y. Prior gradient mask guided pruning-aware fine-tuning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:140–148
Covington P, Adams J, Sargin E. Deep neural networks for youtube recommendations. In: Proceedings of the 10th ACM conference on recommender systems. 2016;191–198
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. 2009;248–255. Ieee
Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. 2018 arXiv:1810.04805
Dong X, Huang J, Yang Y, Yan S. More is less: a more complicated network with less inference complexity. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017;5840–5848
Fang G, Ma X, Song M, Mi MB, Wang X. Depgraph: towards any structural pruning. 2023 arXiv:2301.12900
Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. 2015 arXiv:1510.00149
Hassibi B, Stork DG, Wolff GJ. Optimal brain surgeon and general network pruning. In: IEEE international conference on neural networks. 1993;293–299. IEEE
He Y, Kang G, Dong X, Fu Y, Yang Y. Soft filter pruning for accelerating deep convolutional neural networks. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018;2234–2240
He Y, Liu P, Wang Z, Hu Z, Yang Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;4340–4349
He Y, Zhang X, Sun J. Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE international conference on computer vision. 2017;1389–1397
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015arXiv:1503.02531
Hu H, Bai S, Li A, Cui J, Wang L. Dense relation distillation with context-aware aggregation for few-shot object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2021;10185–10194
Huang Z, Wang N. Data-driven sparse structure selection for deep neural networks. In: ECCV. 2018;304–320
Jung S, Lee D, Park T, Moon T. Fair feature distillation for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021;12115–12124
Kim J, Park S. Kwak N. Paraphrasing complex network: Network compression via factor transfer. NeurIPS; 2018. p. 31.
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8):30–7.
Kovaleva O, Romanov A, Rogers A, Rumshisky A. Revealing the dark secrets of Bert. 2019 arXiv:1908.08593
Krizhevsky A. Learning multiple layers of features from tiny images. Citeseer: Tech. rep; 2009.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.
Lai KH, Zha D, Li Y, Hu X. Dual policy distillation. 2020 arXiv:2006.04061
LeCun Y, Denker J, Solla S. Optimal brain damage. NeurIPS 1989:2
Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788–91.
Li H, Kadav A, Durdanovic I, Samet H, Graf HP. Pruning filters for efficient convnets. 2016arXiv:1608.08710
Li L, Gan Z, Cheng Y, Liu J. Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) October 2019
Lin M, Ji R, Wang Y, Zhang Y, Zhang B, Tian Y, Shao L. Hrank: filter pruning using high-rank feature map. In: CVPR. 2020;1529–1538
Lin S, Ji R, Yan C, Zhang B, Cao L, Ye Q, Huang F, Doermann D. Towards optimal structured CNN pruning via generative adversarial learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019;2790–2799
Liu J, Tang J, Wu G. Residual feature distillation network for lightweight image super-resolution. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. 2020;41–55. Springer
Liu J, Zhuang B, Zhuang Z, Guo Y, Huang J, Zhu J, Tan M. Discrimination-aware network pruning for deep model compression. IEEE Trans Pattern Anal Mach Intell. 2021;44(8):4035–51.
Lu Y, Yang W, Zhang Y, Chen Z, Chen J, Xuan Q, Wang Z, Yang, X. Understanding the dynamics of DNNs using graph modularity. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. 2022;225–242. Springer
Niu W, Ma X, Lin S, Wang S, Qian X, Lin X, Wang Y, Ren B. PatDNN: achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020;907–922
Papernot N, McDaniel P, Wu X, Jha S, Swami A. Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE symposium on security and privacy (SP). 2016;582–597. IEEE
Park E, Ahn J, Yoo S. Weighted-entropy-based quantization for deep neural networks. In: CVPR. 2017;5456–5464
Sanh V, Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. 2019 arXiv:1910.01108
Shan Y, Hoens TR, Jiao J, Wang H, Yu D, Mao J. Deep crossing: web-scale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016;255–262
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature 2016;529(7587):484–489
Tang H, Lu Y, Xuan Q. Sr-init: an interpretable layer pruning method. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023;1–5. IEEE
Wu S, Rupprecht C, Vedaldi A. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) June 2020
Yang C, An Z, Cai L, Xu Y. Hierarchical self-supervised augmented knowledge distillation. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI). 2021;1217–1223
Yang C, An Z, Cai L, Xu Y. Knowledge distillation using hierarchical self-supervision augmented distribution. IEEE Transactions on Neural Networks and Learning Systems 2022;1–15
Yang C, An Z, Cai L, Xu Y. Mutual contrastive learning for visual representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022;36:3045–3053
Yang C, An Z, Li C, Diao B, Xu Y. Multi-objective pruning for CNNs using genetic algorithm. In: International Conference on Artificial Neural Networks. 2019;299–305. Springer
Yang C, An Z, Xu Y. Multi-view contrastive learning for online knowledge distillation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021;3750–3754. IEEE
Yang C, An Z, Zhou H, Cai L, Zhi X, Wu J, Xu Y, Zhang Q. Mixskd: self-knowledge distillation from mixup for image recognition. In: European Conference on Computer Vision. 2022;534–551. Springer
Yang C, An Z, Zhou H, Zhuang F, Xu Y, Zhang Q. Online knowledge distillation via mutual contrastive learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(8):10212–27.
Yang C, An Z, Zhu H, Hu X, Zhang K, Xu K, Li C, Xu Y. Gated convolutional networks with hybrid connectivity for image classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34 :12581–12588
Yang C, Yu X, An Z, Xu Y. Categories of response-based, feature-based, and relation-based knowledge distillation. In: Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems, 2023;1–32. Springer
Yang C, Zhou H, An Z, Jiang X, Xu Y, Zhang Q. Cross-image relational knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022;12319–12328
Yu R, Li A, Chen CF, Lai JH, Morariu VI, Han X, Gao M, Lin CY, Davis LS. NISP: pruning networks using neuron importance score propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;9194–9203
Zhang X, Xie W, Li Y, Lei J, Du Q. Filter pruning via learned representation median in the frequency domain. IEEE Transactions on Cybernetics. 2021;53(5):3165–75.
Acknowledgements
We gratefully acknowledge the financial support for this research provided by the Beijing Natural Science Foundation under grant 4244098. This support has been invaluable to the completion of this work, enabling us to carry out our research with adequate resources and dedication. We extend our sincere thanks to the Foundation for their support and confidence in our project.
Author information
Authors and Affiliations
Contributions
Xi Dan, Wenjie Yang, Fuyan Zhang, Yihang Zhou, Zhuojun Yu, Zhen Qiu, Boyuan Zhao, Zeyu Dong, Libo Huang and Chuanguang Yang Xi Dan led the conception and design of the study, played a pivotal role in the development of the methodology, and conducted the primary experiments, making the greatest contribution to the work. Wenjie Yang and Boyuan Zhao contributed to the formulation of the experimental design, data analysis, and interpretation, and assisted significantly in drafting and revising the manuscript. Fuyan Zhang was involved in data collection, preprocessing, and in conducting supplementary experiments to validate the findings. Yihang Zhou and Zhen Qiu contributed to the development of the software used for data analysis, including coding and debugging, and participated in the preparation of the experimental setup. Zhuojun Yu provided substantial support in literature review, data interpretation, and was involved in critical discussions that shaped the research conclusions. Zeyu Dong and Zhen Qiu assisted in the technical aspects of the project, including software optimization and troubleshooting, and contributed to the manuscript by providing insights into the methodology. Libo Huang and Chuanguang Yang, as the corresponding author, was instrumental in overseeing the project, refining the research questions, and significantly enhancing the manuscript through critical revisions for important intellectual content. Libo Huang and Chuanguang Yang’s contributions were crucial in aligning the study with high standards of scientific rigor and ensuring the integrity and accuracy of the work. Funding Declaration: Beijing Natural Science Foundation, 4244098
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Conflict of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper. No financial or personal relationship with other people or organizations has influenced the work reported in this manuscript, titled “PDD: Pruning Neural Networks During Knowledge Distillation,” Submission ID dccb382d-1a4b-4ee3-a28e-dd6137112c47. This research is conducted purely from an academic perspective, and any affiliations or financial supports are disclosed in the manuscript. Our study has been carried out with a commitment to transparency and honesty in all our research findings and discussions related to this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dan, X., Yang, W., Zhang, F. et al. PDD: Pruning Neural Networks During Knowledge Distillation. Cogn Comput 16, 3457–3467 (2024). https://doi.org/10.1007/s12559-024-10350-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-024-10350-9