Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights

Ballout, Mohamad; Krumnack, Ulf; Heidemann, Gunther; Kühnberger, Kai-Uwe

doi:10.1007/978-3-031-70239-6_3

Mohamad Ballout¹¹,
Ulf Krumnack¹¹,
Gunther Heidemann¹¹ &
…
Kai-Uwe Kühnberger¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14762))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

386 Accesses

Abstract

Enhancing small language models for real-life application deployment is a significant challenge facing the research community. Due to the difficulties and costs of using large language models, researchers are seeking ways to effectively deploy task-specific small models. In this work, we introduce a simple yet effective knowledge distillation method to improve the performance of small language models. Our approach utilizes a teacher model with approximately 3 billion parameters to identify the most influential tokens in its decision-making process. These tokens are extracted from the input based on their attribution scores relative to the output, using methods like saliency maps. These important tokens are then provided as rationales to a student model, aiming to distill the knowledge of the teacher model. This method has proven to be effective, as demonstrated by testing it on four diverse datasets, where it shows improvement over both standard fine-tuning methods and state-of-the-art knowledge distillation models. Furthermore, we explore explanations of the success of the model by analyzing the important tokens extracted from the teacher model. Our findings reveal that in 68% of cases, specifically in datasets where labels are part of the answer, such as multiple-choice questions, the extracted tokens are part of the ground truth.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., Bachem, O.: GKD: generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649 (2023)
Ballout, M., Krumnack, U., Heidemann, G., Kuehnberger, K.U.: Show me how it’s done: the role of explanations in fine-tuning language models. arXiv preprint arXiv:2402.07543 (2024)
Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., Kolesnikov, A.: Knowledge distillation: a good teacher is patient and consistent. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10925–10934 (2022)
Google Scholar
Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., Wattenhofer, R.: On identifiability in transformers. In: 8th International Conference on Learning Representations (ICLR 2020) (Virtual) (2020)
Google Scholar
Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: natural language inference with natural language explanations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar
Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4794–4802 (2019)
Google Scholar
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Cobbe, K., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
Fu, Y., Peng, H., Ou, L., Sabharwal, A., Khot, T.: Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726 (2023)
Hase, P., Bansal, M.: When can models learn from explanations? A formal framework for understanding the roles of explanation data. In: LNLS 2022, vol. 29 (2022)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Ho, N., Schmid, L., Yun, S.Y.: Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 (2022)
Hsieh, C.Y., et al.: Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023)
Li, J., Chen, X., Hovy, E., Jurafsky, D.: Visualizing and understanding neural models in NLP. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2016)
Google Scholar
Li, S., et al.: Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726 (2022)
Magister, L.C., Mallinson, J., Adamek, J., Malmi, E., Severyn, A.: Teaching small language models to reason. arXiv preprint arXiv:2212.08410 (2022)
Mishra, A., Marr, D.: Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. In: International Conference on Learning Representations (2018)
Google Scholar
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., Kiela, D.: Adversarial NLI: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019)
Patel, A., Bhattamishra, S., Goyal, N.: Are NLP models really able to solve simple math word problems? In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094 (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Rajani, N.F., McCann, B., Xiong, C., Socher, R.: Explain yourself! leveraging language models for commonsense reasoning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942 (2019)
Google Scholar
Sanh, V., et al.: Multitask prompted training enables zero-shot task generalization. In: ICLR 2022-Tenth International Conference on Learning Representations (2022)
Google Scholar
Shen, Z., Savvides, M.: Meal v2: boosting vanilla resnet-50 to 80%+ top-1 accuracy on imagenet without tricks. arXiv preprint arXiv:2009.08453 (2020)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)
Google Scholar
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International Conference on Machine Learning, pp. 3319–3328. PMLR (2017)
Google Scholar
Talmor, A., Herzig, J., Lourie, N., Berant, J.: Commonsenseqa: a question answering challenge targeting commonsense knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers), pp. 4149–4158 (2019)
Google Scholar
Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.Y.: Multilingual neural machine translation with knowledge distillation. In: International Conference on Learning Representations (2018)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, L., Li, L., Sun, X.: Gradient knowledge distillation for pre-trained language models. arXiv preprint arXiv:2211.01071 (2022)
Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2021)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
Google Scholar
West, P., et al.: Symbolic knowledge distillation: from general language models to commonsense models. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4602–4625 (2022)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Wu, S., Chen, H., Quan, X., Wang, Q., Wang, R.: AD-KD: attribution-driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010 (2023)
Yao, S., et al.: Tree of thoughts: deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023)
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023)

Download references

Author information

Authors and Affiliations

Institute of Cognitive Science, Osnabrueck University, 49074, Osnabrück, Germany
Mohamad Ballout, Ulf Krumnack, Gunther Heidemann & Kai-Uwe Kühnberger

Authors

Mohamad Ballout
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Krumnack
View author publications
You can also search for this author in PubMed Google Scholar
Gunther Heidemann
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Kühnberger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamad Ballout .

Editor information

Editors and Affiliations

University of Turin, Turin, Italy
Amon Rapp
University of Turin, Turin, Italy
Luigi Di Caro
University of Derby, Derby, UK
Farid Meziane
Oakland University, Rochester, MI, USA
Vijayan Sugumaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ballout, M., Krumnack, U., Heidemann, G., Kühnberger, KU. (2024). Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights. In: Rapp, A., Di Caro, L., Meziane, F., Sugumaran, V. (eds) Natural Language Processing and Information Systems. NLDB 2024. Lecture Notes in Computer Science, vol 14762. Springer, Cham. https://doi.org/10.1007/978-3-031-70239-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-70239-6_3
Published: 20 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70238-9
Online ISBN: 978-3-031-70239-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights