Abstract
Knowledge distillation is widely used in pre-trained language model compression, which can transfer knowledge from a cumbersome model to a lightweight one. Though knowledge distillation based model compression has achieved promising performance, we observe that explanations between the teacher model and the student model are not consistent. We argue that the student model should study not only the predictions of the teacher model but also the internal reasoning process. To this end, we propose Explanation Guided Knowledge Distillation (EGKD) in this article, which utilizes explanations to represent the thinking process and improve knowledge distillation. To obtain explanations in our distillation framework, we select three typical explanation methods rooted in different mechanisms, namely gradient-based, perturbation-based, and feature selection methods. Then, to improve computational efficiency, we propose different optimization strategies to utilize the explanations obtained by these three different explanation methods, which could provide the student model with better learning guidance. Experimental results on GLUE demonstrate that leveraging explanations can improve the performance of the student model. Moreover, our EGKD could also be applied to model compression with different architectures.
- [1] . 2020. Knowledge distillation from internal representations. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020. AAAI Press, 7350–7357. https://aaai.org/ojs/index.php/AAAI/article/view/6229Google ScholarCross Ref
- [2] . 2020. A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 3256–3274.
DOI: Google ScholarCross Ref - [3] . 1993. Heterogeneous reasoning. In International Conference on Conceptual Structures. Springer, 64–74.Google ScholarCross Ref
- [4] . 2019. Interpretable neural predictions with differentiable binary variables. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2963–2977.
DOI: Google ScholarCross Ref - [5] . 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC.Google Scholar
- [6] . 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021). https://arxiv.org/abs/2108.07258Google Scholar
- [7] . 2022. Adversarial training for improving model robustness? Look at both prediction and interpretation. arXiv preprint arXiv:2203.12709 (2022). https://arxiv.org/abs/2203.12709Google Scholar
- [8] . 2018. Quora question pairs. URL https://www.kaggle.com/c/quora-question-pairs (2018).Google Scholar
- [9] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
DOI: Google ScholarCross Ref - [10] . 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016 (JMLR Workshop and Conference Proceedings), and (Eds.), Vol. 48. JMLR.org, 2024–2033. http://proceedings.mlr.press/v48/diamos16.htmlGoogle Scholar
- [11] . 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP’05). https://aclanthology.org/I05-5002Google Scholar
- [12] . 2021. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 968–988.Google Scholar
- [13] . 2022. Compression of deep learning models for text: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 4 (2022), 1–55.Google ScholarDigital Library
- [14] . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015). https://arxiv.org/abs/1503.02531Google Scholar
- [15] . 2021. Alignment rationale for natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5372–5387.
DOI: Google ScholarCross Ref - [16] . 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4163–4174.
DOI: Google ScholarCross Ref - [17] . 2016. Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 107–117.
DOI: Google ScholarCross Ref - [18] . 2020. BERT-EMD: Many-to-many layer mapping for BERT compression with Earth Mover’s Distance. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 3009–3018.
DOI: Google ScholarCross Ref - [19] . 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016). https://arxiv.org/abs/1612.08220Google Scholar
- [20] . 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, , , , , , , and (Eds.). 4765–4774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.htmlGoogle Scholar
- [21] . 2019. Are sixteen heads really better than one?. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, , , , , , and (Eds.). 14014–14024. https://proceedings.neurips.cc/paper/2019/hash/2c601ad9d2ff9bc8b282670cdd54f69f-Abstract.htmlGoogle Scholar
- [22] . 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google ScholarCross Ref
- [23] . 2020. Evaluating explanations: How much do explanations from the teacher aid students? arXiv preprint arXiv:2012.00893 (2020). https://arxiv.org/abs/2012.00893Google Scholar
- [24] . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- [25] . 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392.
DOI: Google ScholarCross Ref - [26] . 2016. “Why Should I Trust You?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13–17, 2016, , , , , , and (Eds.). ACM, 1135–1144.
DOI: Google ScholarDigital Library - [27] . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019). https://arxiv.org/abs/1910.01108Google Scholar
- [28] . 2020. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proc. of AAAI.Google Scholar
- [29] . 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013). https://arxiv.org/abs/1312.6034Google Scholar
- [30] . 2021. Learning to explain: Generating stable explanations fast. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5340–5355.
DOI: Google ScholarCross Ref - [31] . 2017. SmoothGrad: Removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017). https://arxiv.org/abs/1706.03825Google Scholar
- [32] . 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631–1642. https://aclanthology.org/D13-1170Google Scholar
- [33] . 2019. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, Hong Kong, China, 4323–4332.
DOI: Google ScholarCross Ref - [34] . 2017. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017 (Proceedings of Machine Learning Research), and (Eds.), Vol. 70. PMLR, 3319–3328. http://proceedings.mlr.press/v70/sundararajan17a.htmlGoogle Scholar
- [35] . 2019. Distilling task-specific knowledge from BERT into simple neural networks. arXiv preprint arXiv:1903.12136 (2019). https://arxiv.org/abs/1903.12136Google Scholar
- [36] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
- [37] . 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net. https://openreview.net/forum?id=rJ4km2R5t7Google Scholar
- [38] . 2020. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, Online, 6151–6162.
DOI: Google ScholarCross Ref - [39] . 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 6382–6388.Google ScholarCross Ref
- [40] . 2002. Making mathematical arguments in the primary grades: The importance of explaining and justifying ideas. (Principles and Standards). Teaching Children Mathematics 8, 9 (2002), 524–528.Google ScholarCross Ref
- [41] . 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112–1122.
DOI: Google ScholarCross Ref - [42] . 2019. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 374–382.Google ScholarCross Ref
- [43] . 2021. Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression. arXiv preprint arXiv:2109.03228 (2021). https://arxiv.org/abs/2109.03228Google Scholar
- [44] . 2021. Can explanations be useful for calibrating black box models? arXiv preprint arXiv:2110.07586 (2021). https://arxiv.org/abs/2110.07586Google Scholar
- [45] . 2019. Q8BERT: Quantized 8Bit BERT. arXiv preprint arXiv:1910.06188 (2019). https://arxiv.org/abs/1910.06188Google Scholar
Index Terms
- Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression
Recommendations
Knowledge Distillation In Medical Data Mining: A Survey
ICCSE '21: 5th International Conference on Crowd Science and EngineeringIn recent years, there have always been many problems in the medical field, such as a shortage of professionals and a shortage of medical resources. With the application of machine learning in the medical field, these problems have been alleviated to a ...
Multi-target Knowledge Distillation via Student Self-reflection
AbstractKnowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge ...
Spirit Distillation: A Model Compression Method with Multi-domain Knowledge Transfer
Knowledge Science, Engineering and ManagementAbstractRecent applications pose requirements of both cross-domain knowledge transfer and model compression to machine learning models due to insufficient training data and limited computational resources. In this paper, we propose a new knowledge ...
Comments