Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment

Xu, Yangyan; Cao, Cong; Yuan, Fangfang; Mi, Rongxin; Wang, Dakui; Liu, Yanbing; Su, Majing

doi:10.1007/978-3-031-63749-0_13

Yangyan Xu^30,31,
Cong Cao³⁰,
Fangfang Yuan³⁰,
Rongxin Mi³²,
Dakui Wang³⁰,
Yanbing Liu^30,31 &
…
Majing Su³³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14832))

Included in the following conference series:

International Conference on Computational Science

541 Accesses

Abstract

Pre-trained language models encounter a bottleneck in production due to their high computational cost. Model compression methods have emerged as critical technologies for overcoming this bottleneck. As a popular compression method, knowledge distillation transfers knowledge from a large (teacher) model to a small (student) one. However, existing methods perform distillation on the entire data, which easily leads to repetitive learning for the student. Furthermore, the capacity gap between the teacher and student hinders knowledge transfer. To address these issues, we propose the Data-efficient Knowledge Distillation (DeKD) with teacher assistant-based dynamic objective alignment, which empowers the student to dynamically adjust the learning process. Specifically, we first design an entropy-based strategy to select informative instances at the data level, which can reduce the learning from the mastered instances for the student. Next, we introduce the teacher assistant as an auxiliary model for the student at the model level to mitigate the degradation of distillation performance. Finally, we further develop the mechanism of dynamically aligning intermediate representations of the teacher to ensure effective knowledge transfer at the objective level. Extensive experiments on the benchmark datasets show that our method outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Differentiable Feature Aggregation Search for Knowledge Distillation

A Task-Efficient Gradient Guide Knowledge Distillation for Pre-train Language Model Compression

2RDA: Representation and Relation Distillation with Data Augmentation

References

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Li, L., et al.: From mimicking to integrating: knowledge integration for pre-trained language models. In: EMNLP, pp. 6391–6402 (2022)
Google Scholar
Li, Z., Xu, X., Shen, T., Xu, C., Gu, J.C., Tao, C.: Leveraging large language models for NLG evaluation: a survey. arXiv preprint arXiv:2401.07103 (2024)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Luo, S., Wang, X., Fang, G., Hu, Y., Tao, D., Song, M.: Knowledge amalgamation from heterogeneous networks by common feature learning. In: IJCAI, pp. 3087–3093 (2019)
Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: ACL, pp. 142–150 (2011)
Google Scholar
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Saravia, E., Liu, H.T., Huang, Y., Wu, J., Chen, Y.: CARER: contextualized affect representations for emotion recognition. In: EMNLP, pp. 3687–3697 (2018)
Google Scholar
Shen, C., Wang, X., Song, J., Sun, L., Song, M.: Amalgamating knowledge towards comprehensive classification. In: AAAI, pp. 3068–3075 (2019)
Google Scholar
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: ACL/IJCNLP, pp. 4322–4331 (2019)
Google Scholar
Turc, I., Chang, M., Lee, K., Toutanova, K.: Well-read students learn better: the impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962 (2019)
Vongkulbhisal, J., Vinayavekhin, P., Scarzanella, M.V.: Unifying heterogeneous classifiers with distillation. In: CVPR, pp. 3175–3184 (2019)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: ICLR (2019)
Google Scholar
Wang, C., Lu, Y., Mu, Y., Hu, Y., Xiao, T., Zhu, J.: Improved knowledge distillation for pre-trained language models via knowledge selection. arXiv preprint arXiv:2302.00444 (2023)
Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: Minilmv2: multi-head self-attention relation distillation for compressing pretrained transformers. In: ACL/IJCNLP, pp. 2140–2151 (2021)
Google Scholar
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NeurIPS (2020)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP, pp. 38–45 (2020)
Google Scholar
Xu, G., Liu, Z., Loy, C.C.: Computation-efficient knowledge distillation via uncertainty-aware mixup. Pattern Recogn. 138, 109338 (2023)
Article Google Scholar
Xu, Y., Yuan, F., Cao, C., Su, M., Lu, Y., Liu, Y.: A contrastive self-distillation BERT with kernel alignment-based inference. In: Mikyska, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds.) ICCS 2023. LNCS, vol. 14073, pp. 553–565. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35995-8_39
Chapter Google Scholar
Xu, Y., et al.: MetaBERT: collaborative meta-learning for accelerating BERT inference. In: CSCWD, pp. 119–124. IEEE (2023)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: NeurIPS, pp. 5754–5764 (2019)
Google Scholar
Yuan, F., et al.: Reinforced multi-teacher selection for knowledge distillation. In: AAAI, pp. 14284–14291 (2021)
Google Scholar
Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NeurIPS, pp. 649–657 (2015)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program of China (No. 2023YFC3303800).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Yangyan Xu, Cong Cao, Fangfang Yuan, Dakui Wang & Yanbing Liu
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yangyan Xu & Yanbing Liu
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China
Rongxin Mi
The 6th Research Institute of China Electronic Corporations, Beijing, China
Majing Su

Authors

Yangyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Cong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Fangfang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Rongxin Mi
View author publications
You can also search for this author in PubMed Google Scholar
Dakui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanbing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Majing Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fangfang Yuan or Rongxin Mi .

Editor information

Editors and Affiliations

University of Malaga, Malaga, Spain
Leonardo Franco
University of Amsterdam, Amsterdam, The Netherlands
Clélia de Mulatier
AGH University of Science and Technology, Krakow, Poland
Maciej Paszynski
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Tennessee, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, Y. et al. (2024). Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment. In: Franco, L., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2024. ICCS 2024. Lecture Notes in Computer Science, vol 14832. Springer, Cham. https://doi.org/10.1007/978-3-031-63749-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-63749-0_13
Published: 28 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63748-3
Online ISBN: 978-3-031-63749-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment