Abstract
Pre-trained language models encounter a bottleneck in production due to their high computational cost. Model compression methods have emerged as critical technologies for overcoming this bottleneck. As a popular compression method, knowledge distillation transfers knowledge from a large (teacher) model to a small (student) one. However, existing methods perform distillation on the entire data, which easily leads to repetitive learning for the student. Furthermore, the capacity gap between the teacher and student hinders knowledge transfer. To address these issues, we propose the Data-efficient Knowledge Distillation (DeKD) with teacher assistant-based dynamic objective alignment, which empowers the student to dynamically adjust the learning process. Specifically, we first design an entropy-based strategy to select informative instances at the data level, which can reduce the learning from the mastered instances for the student. Next, we introduce the teacher assistant as an auxiliary model for the student at the model level to mitigate the degradation of distillation performance. Finally, we further develop the mechanism of dynamically aligning intermediate representations of the teacher to ensure effective knowledge transfer at the objective level. Extensive experiments on the benchmark datasets show that our method outperforms the state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Li, L., et al.: From mimicking to integrating: knowledge integration for pre-trained language models. In: EMNLP, pp. 6391–6402 (2022)
Li, Z., Xu, X., Shen, T., Xu, C., Gu, J.C., Tao, C.: Leveraging large language models for NLG evaluation: a survey. arXiv preprint arXiv:2401.07103 (2024)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Luo, S., Wang, X., Fang, G., Hu, Y., Tao, D., Song, M.: Knowledge amalgamation from heterogeneous networks by common feature learning. In: IJCAI, pp. 3087–3093 (2019)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: ACL, pp. 142–150 (2011)
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Saravia, E., Liu, H.T., Huang, Y., Wu, J., Chen, Y.: CARER: contextualized affect representations for emotion recognition. In: EMNLP, pp. 3687–3697 (2018)
Shen, C., Wang, X., Song, J., Sun, L., Song, M.: Amalgamating knowledge towards comprehensive classification. In: AAAI, pp. 3068–3075 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: ACL/IJCNLP, pp. 4322–4331 (2019)
Turc, I., Chang, M., Lee, K., Toutanova, K.: Well-read students learn better: the impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962 (2019)
Vongkulbhisal, J., Vinayavekhin, P., Scarzanella, M.V.: Unifying heterogeneous classifiers with distillation. In: CVPR, pp. 3175–3184 (2019)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: ICLR (2019)
Wang, C., Lu, Y., Mu, Y., Hu, Y., Xiao, T., Zhu, J.: Improved knowledge distillation for pre-trained language models via knowledge selection. arXiv preprint arXiv:2302.00444 (2023)
Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: Minilmv2: multi-head self-attention relation distillation for compressing pretrained transformers. In: ACL/IJCNLP, pp. 2140–2151 (2021)
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NeurIPS (2020)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP, pp. 38–45 (2020)
Xu, G., Liu, Z., Loy, C.C.: Computation-efficient knowledge distillation via uncertainty-aware mixup. Pattern Recogn. 138, 109338 (2023)
Xu, Y., Yuan, F., Cao, C., Su, M., Lu, Y., Liu, Y.: A contrastive self-distillation BERT with kernel alignment-based inference. In: Mikyska, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds.) ICCS 2023. LNCS, vol. 14073, pp. 553–565. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35995-8_39
Xu, Y., et al.: MetaBERT: collaborative meta-learning for accelerating BERT inference. In: CSCWD, pp. 119–124. IEEE (2023)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: NeurIPS, pp. 5754–5764 (2019)
Yuan, F., et al.: Reinforced multi-teacher selection for knowledge distillation. In: AAAI, pp. 14284–14291 (2021)
Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NeurIPS, pp. 649–657 (2015)
Acknowledgement
This work is supported by the National Key Research and Development Program of China (No. 2023YFC3303800).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, Y. et al. (2024). Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment. In: Franco, L., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2024. ICCS 2024. Lecture Notes in Computer Science, vol 14832. Springer, Cham. https://doi.org/10.1007/978-3-031-63749-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-63749-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63748-3
Online ISBN: 978-3-031-63749-0
eBook Packages: Computer ScienceComputer Science (R0)