skip to main content
research-article

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

Authors Info & Claims
Published:08 February 2024Publication History
Skip Abstract Section

Abstract

Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this article, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods. Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model with better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.

REFERENCES

  1. [1] Aguilar Gustavo, Ling Yuan, Zhang Yu, Yao Benjamin, Fan Xing, and Guo Chenlei. 2020. Knowledge distillation from internal representations. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, 73507357. https://aaai.org/ojs/index.php/AAAI/article/view/6229Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Atanasova Pepa, Simonsen Jakob Grue, Lioma Christina, and Augenstein Isabelle. 2020. A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 32563274. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Barwise Jon. 1993. Heterogeneous reasoning. In International Conference on Conceptual Structures. Springer, 6474.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bastings Jasmijn, Aziz Wilker, and Titov Ivan. 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 29632977. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bentivogli Luisa, Clark Peter, Dagan Ido, and Giampiccolo Danilo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC.Google ScholarGoogle Scholar
  6. [6] Bommasani Rishi, Hudson Drew A., Adeli Ehsan, Altman Russ, Arora Simran, Arx Sydney von, Bernstein Michael S., Bohg Jeannette, Bosselut Antoine, Brunskill Emma, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021). https://arxiv.org/abs/2108.07258Google ScholarGoogle Scholar
  7. [7] Chen Hanjie and Ji Yangfeng. 2022. Adversarial training for improving model robustness? Look at both prediction and interpretation. arXiv preprint arXiv:2203.12709 (2022). https://arxiv.org/abs/2203.12709Google ScholarGoogle Scholar
  8. [8] Chen Zihan, Zhang Hongbo, Zhang Xiaoji, and Zhao Leqi. 2018. Quora question pairs. URL https://www.kaggle.com/c/quora-question-pairs (2018).Google ScholarGoogle Scholar
  9. [9] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Diamos Greg, Sengupta Shubho, Catanzaro Bryan, Chrzanowski Mike, Coates Adam, Elsen Erich, Engel Jesse H., Hannun Awni Y., and Satheesh Sanjeev. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016 (JMLR Workshop and Conference Proceedings), Balcan Maria-Florina and Weinberger Kilian Q. (Eds.), Vol. 48. JMLR.org, 20242033. http://proceedings.mlr.press/v48/diamos16.htmlGoogle ScholarGoogle Scholar
  11. [11] Dolan William B. and Brockett Chris. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP’05). https://aclanthology.org/I05-5002Google ScholarGoogle Scholar
  12. [12] Feng Steven Y., Gangal Varun, Wei Jason, Chandar Sarath, Vosoughi Soroush, Mitamura Teruko, and Hovy Eduard. 2021. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 968988.Google ScholarGoogle Scholar
  13. [13] Gupta Manish and Agrawal Puneet. 2022. Compression of deep learning models for text: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 4 (2022), 155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Hinton Geoffrey, Vinyals Oriol, and Dean Jeff. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). https://arxiv.org/abs/1503.02531Google ScholarGoogle Scholar
  15. [15] Jiang Zhongtao, Zhang Yuanzhe, Yang Zhao, Zhao Jun, and Liu Kang. 2021. Alignment rationale for natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 53725387. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Jiao Xiaoqi, Yin Yichun, Shang Lifeng, Jiang Xin, Chen Xiao, Li Linlin, Wang Fang, and Liu Qun. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 41634174. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Lei Tao, Barzilay Regina, and Jaakkola Tommi. 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 107117. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Li Jianquan, Liu Xiaokang, Zhao Honghong, Xu Ruifeng, Yang Min, and Jin Yaohong. 2020. BERT-EMD: Many-to-many layer mapping for BERT compression with Earth Mover’s Distance. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 30093018. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Li Jiwei, Monroe Will, and Jurafsky Dan. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016). https://arxiv.org/abs/1612.08220Google ScholarGoogle Scholar
  20. [20] Lundberg Scott M. and Lee Su-In. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman (Eds.). 47654774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.htmlGoogle ScholarGoogle Scholar
  21. [21] Michel Paul, Levy Omer, and Neubig Graham. 2019. Are sixteen heads really better than one?. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, Wallach Hanna M., Larochelle Hugo, Beygelzimer Alina, d’Alché-Buc Florence, Fox Emily B., and Garnett Roman (Eds.). 1401414024. https://proceedings.neurips.cc/paper/2019/hash/2c601ad9d2ff9bc8b282670cdd54f69f-Abstract.htmlGoogle ScholarGoogle Scholar
  22. [22] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Pruthi Danish, Dhingra Bhuwan, Soares Livio Baldini, Collins Michael, Lipton Zachary C., Neubig Graham, and Cohen William W.. 2020. Evaluating explanations: How much do explanations from the teacher aid students? arXiv preprint arXiv:2012.00893 (2020). https://arxiv.org/abs/2012.00893Google ScholarGoogle Scholar
  24. [24] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, Sutskever Ilya, et al. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  25. [25] Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 23832392. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Ribeiro Marco Túlio, Singh Sameer, and Guestrin Carlos. 2016. “Why Should I Trust You?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016, Krishnapuram Balaji, Shah Mohak, Smola Alexander J., Aggarwal Charu C., Shen Dou, and Rastogi Rajeev (Eds.). ACM, 11351144. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Sanh Victor, Debut Lysandre, Chaumond Julien, and Wolf Thomas. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019). https://arxiv.org/abs/1910.01108Google ScholarGoogle Scholar
  28. [28] Shen Sheng, Dong Zhen, Ye Jiayu, Ma Linjian, Yao Zhewei, Gholami Amir, Mahoney Michael W., and Keutzer Kurt. 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. of AAAI.Google ScholarGoogle Scholar
  29. [29] Simonyan Karen, Vedaldi Andrea, and Zisserman Andrew. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013). https://arxiv.org/abs/1312.6034Google ScholarGoogle Scholar
  30. [30] Situ Xuelin, Zukerman Ingrid, Paris Cecile, Maruf Sameen, and Haffari Gholamreza. 2021. Learning to explain: Generating stable explanations fast. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 53405355. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Smilkov Daniel, Thorat Nikhil, Kim Been, Viégas Fernanda, and Wattenberg Martin. 2017. SmoothGrad: Removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017). https://arxiv.org/abs/1706.03825Google ScholarGoogle Scholar
  32. [32] Socher Richard, Perelygin Alex, Wu Jean, Chuang Jason, Manning Christopher D., Ng Andrew, and Potts Christopher. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 16311642. https://aclanthology.org/D13-1170Google ScholarGoogle Scholar
  33. [33] Sun Siqi, Cheng Yu, Gan Zhe, and Liu Jingjing. 2019. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, Hong Kong, China, 43234332. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Sundararajan Mukund, Taly Ankur, and Yan Qiqi. 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017 (Proceedings of Machine Learning Research), Precup Doina and Teh Yee Whye (Eds.), Vol. 70. PMLR, 33193328. http://proceedings.mlr.press/v70/sundararajan17a.htmlGoogle ScholarGoogle Scholar
  35. [35] Tang Raphael, Lu Yao, Liu Linqing, Mou Lili, Vechtomova Olga, and Lin Jimmy. 2019. Distilling task-specific knowledge from BERT into simple neural networks. arXiv preprint arXiv:1903.12136 (2019). https://arxiv.org/abs/1903.12136Google ScholarGoogle Scholar
  36. [36] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google ScholarGoogle Scholar
  37. [37] Wang Alex, Singh Amanpreet, Michael Julian, Hill Felix, Levy Omer, and Bowman Samuel R.. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net. https://openreview.net/forum?id=rJ4km2R5t7Google ScholarGoogle Scholar
  38. [38] Wang Ziheng, Wohlwend Jeremy, and Lei Tao. 2020. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 61516162. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Wei Jason and Zou Kai. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 63826388.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Whitenack Joy and Yackel Erna. 2002. Making mathematical arguments in the primary grades: The importance of explaining and justifying ideas. (Principles and Standards). Teaching Children Mathematics 8, 9 (2002), 524528.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Williams Adina, Nangia Nikita, and Bowman Samuel. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 11121122. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wu Yue, Chen Yinpeng, Wang Lijuan, Ye Yuancheng, Liu Zicheng, Guo Yandong, and Fu Yun. 2019. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 374382.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Xu Canwen, Zhou Wangchunshu, Ge Tao, Xu Ke, McAuley Julian, and Wei Furu. 2021. Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression. arXiv preprint arXiv:2109.03228 (2021). https://arxiv.org/abs/2109.03228Google ScholarGoogle Scholar
  44. [44] Ye Xi and Durrett Greg. 2021. Can explanations be useful for calibrating black box models? arXiv preprint arXiv:2110.07586 (2021). https://arxiv.org/abs/2110.07586Google ScholarGoogle Scholar
  45. [45] Zafrir Ofir, Boudoukh Guy, Izsak Peter, and Wasserblat Moshe. 2019. Q8BERT: Quantized 8Bit BERT. arXiv preprint arXiv:1910.06188 (2019). https://arxiv.org/abs/1910.06188Google ScholarGoogle Scholar

Index Terms

  1. Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
      February 2024
      340 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3613556
      • Editor:
      • Imed Zitouni
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2024
      • Online AM: 29 December 2023
      • Accepted: 14 December 2023
      • Received: 11 April 2023
      Published in tallip Volume 23, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)186
      • Downloads (Last 6 weeks)55

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text