ATMKD: adaptive temperature guided multi-teacher knowledge distillation

Lin, Yu-e; Yin, Shuting; Ding, Yifeng; Liang, Xingzhu

doi:10.1007/s00530-024-01483-w

ATMKD: adaptive temperature guided multi-teacher knowledge distillation

Regular Paper
Published: 26 September 2024

Volume 30, article number 292, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yu-e Lin¹,
Shuting Yin¹,
Yifeng Ding¹ &
…
Xingzhu Liang^1,2,3

523 Accesses
Explore all metrics

Abstract

Knowledge distillation is a technique that aims to distill the knowledge from a large well-trained teacher model to a lightweight student model. In recent years, multi-teacher knowledge distillation has received widespread attention, exploring diverse knowledge sources from multiple teachers to provide students with more comprehensive guidance. However, existing multi-teacher distillation methods usually use a single aggregation strategy, ignoring the disparities among different types of knowledge. In addition, they usually set the temperature to a fixed value, ignoring the effect of temperature on multi-teacher knowledge distillation. To address these issues, we propose adaptive temperature guided multi-teacher knowledge distillation (ATMKD), which uses adaptive temperature and diverse aggregation strategy to improve distillation performance. Specifically, we internally leverage dynamic and learnable temperature to adaptively control the difficulty level of multi-teacher knowledge. Externally, the diverse aggregation strategy is used to fuse rich knowledge from multiple teachers. Optimizing the teacher output at both internal and external levels can provide more comprehensive guidance for the student model and achieve better distillation performance. The extensive experiments with various teacher and student architectures on multiple benchmark datasets show that the proposed approach outperforms other knowledge distillation methods. The code is available at https://github.com/JSJ515-Group/ATMKD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Condensation Distillation

Collaborative Multiple-Student Single-Teacher for Online Learning

DistPro: Searching a Fast Knowledge Distillation Process via Meta Optimization

Data availability

The data are available from the corresponding, author on reasonable request. Three datasets are used in our experiments, including CIFAR100, Tiny-ImageNet and Stanford Dogs. The CIFAR100 dataset is selected from https://www.cs.toronto.edu/~kriz/cifar.html, The Tiny-ImageNet dataset is selected from http://cs231n.stanford.edu/tiny-imagenet-200.zip, The Stanford Dogs dataset is selected from http://vision.stanford.edu/aditya86/ImageNetDogs/.

References

Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125 (2017).
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015).
Hinton, G., Vinyals, O., Dean, J., Distilling the knowledge in a neural network, arXiv preprint arXiv:150302531, (2015).
Adriana, R., Nicolas, B., Ebrahimi, K. S., Antoine, C., Carlo, G., Yoshua, B.: Fitnets: Hints for thin deep nets, In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1 (2015).
Zagoruyko, S., Komodakis, N., Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, arXiv preprint arXiv:161203928, (2016).
Passalis, N., Tefas, A.: Learning deep representations with probabilistic knowledge transfer, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 268–284 (2018).
Yang, G., Ding, Y., Fang, X., Zhang, J., Chu, Y.: Fast face swapping with high-fidelity lightweight generator assisted by online knowledge distillation. Vis. Comput. (2024). https://doi.org/10.1007/s00371-024-03414-2
Article Google Scholar
Heo, B., Lee, M., Yun, S., Choi, J. Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3779–3787 (2019).
Kim, J., Park, S., Kwak, N., Paraphrasing complex network: Network compression via factor transfer, Advances in Neural Information Processing Systems (NeurIPS), 31, (2018).
You, S., Xu, C., Xu, C., Tao, D.: Learning from multiple teacher networks, In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1285–1294 (2017).
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient Knowledge Distillation from an Ensemble of Teachers, In: Conference of the International Speech Communication Association (INTERSPEECH), pp. 3697–3701 (2017).
Zhang, H., Chen, D., Wang, C.: Confidence-aware multi-teacher knowledge distillation, In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4498–4502 (2022).
Zhang, H., Chen, D., Wang, C.: Adaptive multi-teacher knowledge distillation with meta-learning, In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1943–1948 (2023)..
Yuan, F., Shou, L., Pei, J., Lin, W., Gong, M., Fu, Y., Jiang, D.: Reinforced multi-teacher selection for knowledge distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14284–14291 (2021).
Caubrière, A., Tomashenko, N., Laurent, A., Morin, E., Camelin, N., Esteve, Y.: Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability, In: Conference of the International Speech Communication Association (INTERSPEECH), 2019).
Duan, Y., Zhu, H., Wang, H., Yi, L., Nevatia, R., Guibas, L. J.: Curriculum deepsdf, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67 (2020).
Xiang, L., Ding, G., Han, J.: Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification, In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263 (2020).
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhou, S., Zhang, Z.: Correlation congruence for knowledge distillation, In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5007–5016 (2019).
Liu, J. L. A. B. L. A. H. L. A. Y., Meta Knowledge Distillation, arXiv preprint arXiv:220207940, (2022).
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation, In: International Conference on Machine Learning (ICML), pp. 1180–1189 (2015).
Tung, F., Mori, G.: Similarity-preserving knowledge distillation, In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1365–1374 (2019).
Ahn, S., Hu, S. X., Damianou, A., Lawrence, N. D., Dai, Z.: Variational information distillation for knowledge transfer, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9163–9171 (2019).
Tian, Y., Krishnan, D., Isola, P., Contrastive representation distillation, arXiv preprint arXiv:191010699, (2019).
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4133–4141 (2017).
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962 (2022).
Jin, Y., Wang, J., Lin, D.: Multi-level logit distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24276–24285 (2023).
Liu, Y., Zhang, W., Wang, J.: Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415, 106–113 (2020). https://doi.org/10.1016/J.NEUCOM.2020.07.048
Article Google Scholar
Kwon, K., Na, H., Lee, H., Kim, N. S.: Adaptive knowledge distillation based on entropy, In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7409–7413 (2020).
Du, S., You, S., Li, X., Wu, J., Wang, F., Qian, C., Zhang, C.: Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. Adv. Neural. Inf. Process. Syst. (NeurIPS) 33, 12345–12355 (2020)
Google Scholar
Ding, Z., Jiang, G., Zhang, S., Guo, L., Lin, W.: How to Trade Off the Quantity and Capacity of Teacher Ensemble: Learning Categorical Distribution to Stochastically Employ a Teacher for Distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 17915–17923 (2024).
Sinha, S., Garg, A., Larochelle, H.: Curriculum by smoothing. Adv. Neural. Inf. Process. Syst. (NeurIPS) 33, 21653–21664 (2020)
Google Scholar
Tay, Y., Wang, S., Tuan, L.A., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C., Zhang, A.: Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. Assoc. Comput. Linguist. (ACL) (2019). https://doi.org/10.18653/V1/P19-1486
Article Google Scholar
Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4555–4576 (2021). https://doi.org/10.1109/TPAMI.2021.3069908
Article Google Scholar
Yu, L., Weng, Z., Wang, Y., Zhu, Y.: Multi-teacher knowledge distillation for incremental implicitly-refined classification, In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022).
Pham, C., Hoang, T., Do, T.-T.: Collaborative multi-teacher knowledge distillation for learning low bit-width deep neural networks, In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6435–6443 (2023).
Guo, J., Reducing the teacher-student gap via adaptive temperatures, (2021).
Li, X.-C., Fan, W.-S., Song, S., Li, Y., Yunfeng, S., Zhan, D.-C.: Asymmetric temperature scaling makes larger networks teach well again. Adv. Neural Inf. Process. Syst. (NeurIPS) 35, 3830–3842 (2022)
Google Scholar
Zheng, K., Yang, E.-H., Knowledge Distillation Based on Transformed Teacher Matching, arXiv preprint arXiv:240211148, (2024).
Chen, D., Mei, J.-P., Zhang, H., Wang, C., feng, Y., Chen, C.: Knowledge distillation with the reused teacher classifier, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11933–11942 (2022).
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., Generative adversarial nets, Advances in Neural Information Processing Systems (NeurIPS), 27, (2014).
Krizhevsky, A., Hinton, G., Learning multiple layers of features from tiny images, (2009).
Le, Y., Yang, X., Tiny imagenet visual recognition challenge, CS 231N, 7 (7), 3 (2015)
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.-F.: Novel dataset for fine-grained image categorization: Stanford dogs, In: Proc CVPR workshop on fine-grained visual categorization (FGVC), 2011).
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, In: Proceedings of the International Conference on Learning Representations (ICLR), 2015).
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016).
Zagoruyko, S., Komodakis, N.: Wide residual networks, In: Proceedings of the British Machine Vision Conference (BMVC), 2016).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018).
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856 (2018).
Chen, D., Mei, J.-P., Zhang, Y., Wang, C., Wang, Z., Feng, Y., Chen, C.: Cross-layer distillation with semantic calibration, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7028–7036 (2021).
Yang, J., Martinez, B., Bulat, A., Tzimiropoulos, G.: Knowledge distillation via softmax regression representation learning, In: International Conference on Learning Representations (ICLR), 2020)
Li, Z., Li, X., Yang, L., Zhao, B., Song, R., Luo, L., Li, J., Yang, J.: Curriculum temperature for knowledge distillation, In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1504–1512 (2023).
Chi, Z., Zheng, T., Li, H., Yang, Z., Wu, B., Lin, B., Cai, D., Normkd: Normalized logits for knowledge distillation, arXiv preprint arXiv:230800520, (2023).
Guo, Z., Yan, H., Li, H., Lin, X.: Class attention transfer based knowledge distillation, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11868–11877 (2023).
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017).

Download references

Acknowledgements

This research was supported in part by the Research Foundation of the Institute of Environment-friendly Materials and Occupational Health (Wuhu) Anhui University of Science and Technology (No. ALW2021YF04), in part by the Medical Special Cultivation Project of Anhui University of Science and Technology (No. YZ2023H2C005), and in part by the 2023 Graduate Innovation Fund Project of Anhui University of Science and Technology under grant number 2023CX2136.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Anhui University of Science and Technology, Huainan, 232001, China
Yu-e Lin, Shuting Yin, Yifeng Ding & Xingzhu Liang
Institute of Environment-Friendly Materials and Occupational Health, Anhui University of Science and Technology, Wuhu, 241003, China
Xingzhu Liang
The First Affiliated Hospital of Anhui, University of Science and Technology (Huainan First People’s Hospital), Huainan, 232001, China
Xingzhu Liang

Authors

Yu-e Lin
View author publications
You can also search for this author inPubMed Google Scholar
Shuting Yin
View author publications
You can also search for this author inPubMed Google Scholar
Yifeng Ding
View author publications
You can also search for this author inPubMed Google Scholar
Xingzhu Liang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yu-e Lin and Xingzhu Liang agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved; Shuting Yin made substantial contributions to the conception and drafted the work; Yifeng Ding performed the experiments and analyzed the data. Shuting Yin and Yifeng Ding wrote the original manuscript. All the authors reviewed the manuscript.

Corresponding author

Correspondence to Yifeng Ding.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

This study was conducted with the highest regard for ethical standards and following relevant guidelines and regulations. The research protocol did not require ethical review or approval as it did not involve human participants, animals, or sensitive data. All data used in this study were obtained from publicly available sources and were properly cited and acknowledged. No private or personally identifiable information was used or accessed during this research. The authors declare that there are no conflicts of interest, financial or otherwise, that could potentially influence the objectivity or integrity of this study. While no ethical review or approval was necessary for this particular study, the principles of academic integrity and research ethics were strictly adhered to throughout the research process.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, Ye., Yin, S., Ding, Y. et al. ATMKD: adaptive temperature guided multi-teacher knowledge distillation. Multimedia Systems 30, 292 (2024). https://doi.org/10.1007/s00530-024-01483-w

Download citation

Received: 29 April 2024
Accepted: 06 September 2024
Published: 26 September 2024
DOI: https://doi.org/10.1007/s00530-024-01483-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ATMKD: adaptive temperature guided multi-teacher knowledge distillation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Knowledge Condensation Distillation

Collaborative Multiple-Student Single-Teacher for Online Learning

DistPro: Searching a Fast Knowledge Distillation Process via Meta Optimization

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now