Skip to main content

Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment

  • Conference paper
  • First Online:
Computational Science – ICCS 2024 (ICCS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14832))

Included in the following conference series:

  • 541 Accesses

Abstract

Pre-trained language models encounter a bottleneck in production due to their high computational cost. Model compression methods have emerged as critical technologies for overcoming this bottleneck. As a popular compression method, knowledge distillation transfers knowledge from a large (teacher) model to a small (student) one. However, existing methods perform distillation on the entire data, which easily leads to repetitive learning for the student. Furthermore, the capacity gap between the teacher and student hinders knowledge transfer. To address these issues, we propose the Data-efficient Knowledge Distillation (DeKD) with teacher assistant-based dynamic objective alignment, which empowers the student to dynamically adjust the learning process. Specifically, we first design an entropy-based strategy to select informative instances at the data level, which can reduce the learning from the mastered instances for the student. Next, we introduce the teacher assistant as an auxiliary model for the student at the model level to mitigate the degradation of distillation performance. Finally, we further develop the mechanism of dynamically aligning intermediate representations of the teacher to ensure effective knowledge transfer at the objective level. Extensive experiments on the benchmark datasets show that our method outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  2. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  3. Li, L., et al.: From mimicking to integrating: knowledge integration for pre-trained language models. In: EMNLP, pp. 6391–6402 (2022)

    Google Scholar 

  4. Li, Z., Xu, X., Shen, T., Xu, C., Gu, J.C., Tao, C.: Leveraging large language models for NLG evaluation: a survey. arXiv preprint arXiv:2401.07103 (2024)

  5. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  6. Luo, S., Wang, X., Fang, G., Hu, Y., Tao, D., Song, M.: Knowledge amalgamation from heterogeneous networks by common feature learning. In: IJCAI, pp. 3087–3093 (2019)

    Google Scholar 

  7. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: ACL, pp. 142–150 (2011)

    Google Scholar 

  8. OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  9. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)

    Google Scholar 

  10. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  11. Saravia, E., Liu, H.T., Huang, Y., Wu, J., Chen, Y.: CARER: contextualized affect representations for emotion recognition. In: EMNLP, pp. 3687–3697 (2018)

    Google Scholar 

  12. Shen, C., Wang, X., Song, J., Sun, L., Song, M.: Amalgamating knowledge towards comprehensive classification. In: AAAI, pp. 3068–3075 (2019)

    Google Scholar 

  13. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: ACL/IJCNLP, pp. 4322–4331 (2019)

    Google Scholar 

  14. Turc, I., Chang, M., Lee, K., Toutanova, K.: Well-read students learn better: the impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962 (2019)

  15. Vongkulbhisal, J., Vinayavekhin, P., Scarzanella, M.V.: Unifying heterogeneous classifiers with distillation. In: CVPR, pp. 3175–3184 (2019)

    Google Scholar 

  16. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: ICLR (2019)

    Google Scholar 

  17. Wang, C., Lu, Y., Mu, Y., Hu, Y., Xiao, T., Zhu, J.: Improved knowledge distillation for pre-trained language models via knowledge selection. arXiv preprint arXiv:2302.00444 (2023)

  18. Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: Minilmv2: multi-head self-attention relation distillation for compressing pretrained transformers. In: ACL/IJCNLP, pp. 2140–2151 (2021)

    Google Scholar 

  19. Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NeurIPS (2020)

    Google Scholar 

  20. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP, pp. 38–45 (2020)

    Google Scholar 

  21. Xu, G., Liu, Z., Loy, C.C.: Computation-efficient knowledge distillation via uncertainty-aware mixup. Pattern Recogn. 138, 109338 (2023)

    Article  Google Scholar 

  22. Xu, Y., Yuan, F., Cao, C., Su, M., Lu, Y., Liu, Y.: A contrastive self-distillation BERT with kernel alignment-based inference. In: Mikyska, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds.) ICCS 2023. LNCS, vol. 14073, pp. 553–565. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35995-8_39

    Chapter  Google Scholar 

  23. Xu, Y., et al.: MetaBERT: collaborative meta-learning for accelerating BERT inference. In: CSCWD, pp. 119–124. IEEE (2023)

    Google Scholar 

  24. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: NeurIPS, pp. 5754–5764 (2019)

    Google Scholar 

  25. Yuan, F., et al.: Reinforced multi-teacher selection for knowledge distillation. In: AAAI, pp. 14284–14291 (2021)

    Google Scholar 

  26. Zhang, X., Zhao, J.J., LeCun, Y.: Character-level convolutional networks for text classification. In: NeurIPS, pp. 649–657 (2015)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Key Research and Development Program of China (No. 2023YFC3303800).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fangfang Yuan or Rongxin Mi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, Y. et al. (2024). Data-Efficient Knowledge Distillation with Teacher Assistant-Based Dynamic Objective Alignment. In: Franco, L., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds) Computational Science – ICCS 2024. ICCS 2024. Lecture Notes in Computer Science, vol 14832. Springer, Cham. https://doi.org/10.1007/978-3-031-63749-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-63749-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-63748-3

  • Online ISBN: 978-3-031-63749-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics