Skip to main content

Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14762))

  • 386 Accesses

Abstract

Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: GKD: generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649 (2023)

  2. Ballout, M., Krumnack, U., Heidemann, G., Kuehnberger, K.U.: Show me how it’s done: the role of explanations in fine-tuning language models. arXiv preprint arXiv:2402.07543 (2024)

  3. Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., Kolesnikov, A.: Knowledge distillation: a good teacher is patient and consistent. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10925–10934 (2022)

    Google Scholar 

  4. Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., Wattenhofer, R.: On identifiability in transformers. In: 8th International Conference on Learning Representations (ICLR 2020) (Virtual) (2020)

    Google Scholar 

  5. Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: natural language inference with natural language explanations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  6. Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4794–4802 (2019)

    Google Scholar 

  7. Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  8. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)

  9. Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  10. Fu, Y., Peng, H., Ou, L., Sabharwal, A., Khot, T.: Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726 (2023)

  11. Hase, P., Bansal, M.: When can models learn from explanations? A formal framework for understanding the roles of explanation data. In: LNLS 2022, vol. 29 (2022)

    Google Scholar 

  12. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  13. Ho, N., Schmid, L., Yun, S.Y.: Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 (2022)

  14. Hsieh, C.Y., et al.: Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023)

  15. Li, J., Chen, X., Hovy, E., Jurafsky, D.: Visualizing and understanding neural models in NLP. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2016)

    Google Scholar 

  16. Li, S., et al.: Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726 (2022)

  17. Magister, L.C., Mallinson, J., Adamek, J., Malmi, E., Severyn, A.: Teaching small language models to reason. arXiv preprint arXiv:2212.08410 (2022)

  18. Mishra, A., Marr, D.: Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. In: International Conference on Learning Representations (2018)

    Google Scholar 

  19. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial NLI: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019)

  20. Patel, A., Bhattamishra, S., Goyal, N.: Are NLP models really able to solve simple math word problems? In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094 (2021)

    Google Scholar 

  21. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  22. Rajani, N.F., McCann, B., Xiong, C., Socher, R.: Explain yourself! leveraging language models for commonsense reasoning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942 (2019)

    Google Scholar 

  23. Sanh, V., et al.: Multitask prompted training enables zero-shot task generalization. In: ICLR 2022-Tenth International Conference on Learning Representations (2022)

    Google Scholar 

  24. Shen, Z., Savvides, M.: Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453 (2020)

  25. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  26. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)

    Google Scholar 

  27. Talmor, A., Herzig, J., Lourie, N., Berant, J.: Commonsenseqa: a question answering challenge targeting commonsense knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers), pp. 4149–4158 (2019)

    Google Scholar 

  28. Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.Y.: Multilingual neural machine translation with knowledge distillation. In: International Conference on Learning Representations (2018)

    Google Scholar 

  29. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  30. Wang, L., Li, L., Sun, X.: Gradient knowledge distillation for pre-trained language models. arXiv preprint arXiv:2211.01071 (2022)

  31. Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2021)

    Google Scholar 

  32. Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)

    Google Scholar 

  33. West, P., et al.: Symbolic knowledge distillation: from general language models to commonsense models. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4602–4625 (2022)

    Google Scholar 

  34. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

  35. Wu, S., Chen, H., Quan, X., Wang, Q., Wang, R.: AD-KD: attribution-driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010 (2023)

  36. Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)

  37. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamad Ballout .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ballout, M., Krumnack, U., Heidemann, G., Kühnberger, KU. (2024). Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70239-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70238-9

  • Online ISBN: 978-3-031-70239-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics