Abstract
Knowledge distillation is a technique that aims to distill the knowledge from a large well-trained teacher model to a lightweight student model. In recent years, multi-teacher knowledge distillation has received widespread attention, exploring diverse knowledge sources from multiple teachers to provide students with more comprehensive guidance. However, existing multi-teacher distillation methods usually use a single aggregation strategy, ignoring the disparities among different types of knowledge. In addition, they usually set the temperature to a fixed value, ignoring the effect of temperature on multi-teacher knowledge distillation. To address these issues, we propose adaptive temperature guided multi-teacher knowledge distillation (ATMKD), which uses adaptive temperature and diverse aggregation strategy to improve distillation performance. Specifically, we internally leverage dynamic and learnable temperature to adaptively control the difficulty level of multi-teacher knowledge. Externally, the diverse aggregation strategy is used to fuse rich knowledge from multiple teachers. Optimizing the teacher output at both internal and external levels can provide more comprehensive guidance for the student model and achieve better distillation performance. The extensive experiments with various teacher and student architectures on multiple benchmark datasets show that the proposed approach outperforms other knowledge distillation methods. The code is available at https://github.com/JSJ515-Group/ATMKD.












Similar content being viewed by others
Data availability
The data are available from the corresponding, author on reasonable request. Three datasets are used in our experiments, including CIFAR100, Tiny-ImageNet and Stanford Dogs. The CIFAR100 dataset is selected from https://www.cs.toronto.edu/~kriz/cifar.html, The Tiny-ImageNet dataset is selected from http://cs231n.stanford.edu/tiny-imagenet-200.zip, The Stanford Dogs dataset is selected from http://vision.stanford.edu/aditya86/ImageNetDogs/.
References
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017).
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015).
Hinton, G., Vinyals, O., Dean, J., Distilling the knowledge in a neural network, arXiv preprint arXiv:150302531, (2015).
Adriana, R., Nicolas, B., Ebrahimi, K. S., Antoine, C., Carlo, G., Yoshua, B.: Fitnets: Hints for thin deep nets, In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 (2015).
Zagoruyko, S., Komodakis, N., Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, arXiv preprint arXiv:161203928, (2016).
Passalis, N., Tefas, A.: Learning deep representations with probabilistic knowledge transfer, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 268–284 (2018).
Yang, G., Ding, Y., Fang, X., Zhang, J., Chu, Y.: Fast face swapping with high-fidelity lightweight generator assisted by online knowledge distillation. Vis. Comput. (2024). https://doi.org/10.1007/s00371-024-03414-2
Heo, B., Lee, M., Yun, S., Choi, J. Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3779–3787 (2019).
Kim, J., Park, S., Kwak, N., Paraphrasing complex network: Network compression via factor transfer, Advances in Neural Information Processing Systems (NeurIPS), 31, (2018).
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks, In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1285–1294 (2017).
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient Knowledge Distillation from an Ensemble of Teachers, In: Conference of the International Speech Communication Association (INTERSPEECH), pp. 3697–3701 (2017).
Zhang, H., Chen, D., Wang, C.: Confidence-aware multi-teacher knowledge distillation, In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4498–4502 (2022).
Zhang, H., Chen, D., Wang, C.: Adaptive multi-teacher knowledge distillation with meta-learning, In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1943–1948 (2023)..
Yuan, F., Shou, L., Pei, J., Lin, W., Gong, M., Fu, Y., Jiang, D.: Reinforced multi-teacher selection for knowledge distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14284–14291 (2021).
Caubrière, A., Tomashenko, N., Laurent, A., Morin, E., Camelin, N., Esteve, Y.: Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability, In: Conference of the International Speech Communication Association (INTERSPEECH), 2019).
Duan, Y., Zhu, H., Wang, H., Yi, L., Nevatia, R., Guibas, L. J.: Curriculum deepsdf, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67 (2020).
Xiang, L., Ding, G., Han, J.: Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263 (2020).
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., Zhang, Z.: Correlation congruence for knowledge distillation, In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5007–5016 (2019).
Liu, J. L. A. B. L. A. H. L. A. Y., Meta Knowledge Distillation, arXiv preprint arXiv:220207940, (2022).
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation, In: International Conference on Machine Learning (ICML), pp. 1180–1189 (2015).
Tung, F., Mori, G.: Similarity-preserving knowledge distillation, In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1365–1374 (2019).
Ahn, S., Hu, S. X., Damianou, A., Lawrence, N. D., Dai, Z.: Variational information distillation for knowledge transfer, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9163–9171 (2019).
Tian, Y., Krishnan, D., Isola, P., Contrastive representation distillation, arXiv preprint arXiv:191010699, (2019).
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4133–4141 (2017).
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962 (2022).
Jin, Y., Wang, J., Lin, D.: Multi-level logit distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24276–24285 (2023).
Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415, 106–113 (2020). https://doi.org/10.1016/J.NEUCOM.2020.07.048
Kwon, K., Na, H., Lee, H., Kim, N. S.: Adaptive knowledge distillation based on entropy, In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7409–7413 (2020).
Du, S., You, S., Li, X., Wu, J., Wang, F., Qian, C., Zhang, C.: Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. Adv. Neural. Inf. Process. Syst. (NeurIPS) 33, 12345–12355 (2020)
Ding, Z., Jiang, G., Zhang, S., Guo, L., Lin, W.: How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 17915–17923 (2024).
Sinha, S., Garg, A., Larochelle, H.: Curriculum by smoothing. Adv. Neural. Inf. Process. Syst. (NeurIPS) 33, 21653–21664 (2020)
Tay, Y., Wang, S., Tuan, L.A., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C., Zhang, A.: Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. Assoc. Comput. Linguist. (ACL) (2019). https://doi.org/10.18653/V1/P19-1486
Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4555–4576 (2021). https://doi.org/10.1109/TPAMI.2021.3069908
Yu, L., Weng, Z., Wang, Y., Zhu, Y.: Multi-teacher knowledge distillation for incremental implicitly-refined classification, In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022).
Pham, C., Hoang, T., Do, T.-T.: Collaborative multi-teacher knowledge distillation for learning low bit-width deep neural networks, In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6435–6443 (2023).
Guo, J., Reducing the teacher-student gap via adaptive temperatures, (2021).
Li, X.-C., Fan, W.-S., Song, S., Li, Y., Yunfeng, S., Zhan, D.-C.: Asymmetric temperature scaling makes larger networks teach well again. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 3830–3842 (2022)
Zheng, K., Yang, E.-H., Knowledge Distillation Based on Transformed Teacher Matching, arXiv preprint arXiv:240211148, (2024).
Chen, D., Mei, J.-P., Zhang, H., Wang, C., feng, Y., Chen, C.: Knowledge distillation with the reused teacher classifier, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11933–11942 (2022).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., Generative adversarial nets, Advances in Neural Information Processing Systems (NeurIPS), 27, (2014).
Krizhevsky, A., Hinton, G., Learning multiple layers of features from tiny images, (2009).
Le, Y., Yang, X., Tiny imagenet visual recognition challenge, CS 231N, 7 (7), 3 (2015)
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs, In: Proc CVPR workshop on fine-grained visual categorization (FGVC), 2011).
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, In: Proceedings of the International Conference on Learning Representations (ICLR), 2015).
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016).
Zagoruyko, S., Komodakis, N.: Wide residual networks, In: Proceedings of the British Machine Vision Conference (BMVC), 2016).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018).
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856 (2018).
Chen, D., Mei, J.-P., Zhang, Y., Wang, C., Wang, Z., Feng, Y., Chen, C.: Cross-layer distillation with semantic calibration, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7028–7036 (2021).
Yang, J., Martinez, B., Bulat, A., Tzimiropoulos, G.: Knowledge distillation via softmax regression representation learning, In: International Conference on Learning Representations (ICLR), 2020)
Li, Z., Li, X., Yang, L., Zhao, B., Song, R., Luo, L., Li, J., Yang, J.: Curriculum temperature for knowledge distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1504–1512 (2023).
Chi, Z., Zheng, T., Li, H., Yang, Z., Wu, B., Lin, B., Cai, D., Normkd: Normalized logits for knowledge distillation, arXiv preprint arXiv:230800520, (2023).
Guo, Z., Yan, H., Li, H., Lin, X.: Class attention transfer based knowledge distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11868–11877 (2023).
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017).
Acknowledgements
This research was supported in part by the Research Foundation of the Institute of Environment-friendly Materials and Occupational Health (Wuhu) Anhui University of Science and Technology (No. ALW2021YF04), in part by the Medical Special Cultivation Project of Anhui University of Science and Technology (No. YZ2023H2C005), and in part by the 2023 Graduate Innovation Fund Project of Anhui University of Science and Technology under grant number 2023CX2136.
Author information
Authors and Affiliations
Contributions
Yu-e Lin and Xingzhu Liang agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved; Shuting Yin made substantial contributions to the conception and drafted the work; Yifeng Ding performed the experiments and analyzed the data. Shuting Yin and Yifeng Ding wrote the original manuscript. All the authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
This study was conducted with the highest regard for ethical standards and following relevant guidelines and regulations. The research protocol did not require ethical review or approval as it did not involve human participants, animals, or sensitive data. All data used in this study were obtained from publicly available sources and were properly cited and acknowledged. No private or personally identifiable information was used or accessed during this research. The authors declare that there are no conflicts of interest, financial or otherwise, that could potentially influence the objectivity or integrity of this study. While no ethical review or approval was necessary for this particular study, the principles of academic integrity and research ethics were strictly adhered to throughout the research process.
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, Ye., Yin, S., Ding, Y. et al. ATMKD: adaptive temperature guided multi-teacher knowledge distillation. Multimedia Systems 30, 292 (2024). https://doi.org/10.1007/s00530-024-01483-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01483-w