Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Kong, Jun; Wang, Jin; Zhang, Xuejie

doi:10.1007/978-3-030-88480-2_18

Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Jun Kong¹²,
Jin Wang¹² &
Xuejie Zhang¹²

Conference paper
First Online: 06 October 2021

2672 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13028))

Abstract

Pretrained language models (PLMs) have achieved remarkable results in various natural language processing tasks. As the performance of the model increases, it is also accompanied by more computational consumption and longer inference time, which makes deploying PLMs in edge devices for low-latency applications challenging. To address this issue, recent studies have recommended applying either model compression or early-exiting techniques to accelerate the inference. However, model compression permanently discards the modules of the model, leading to a decline in model performance. Train the PLMs backbone and the early-exiting classifier separately with early-exiting strategies. It not only brings extra training cost but also loses semantic information from higher layers, resulting in unreliable decisions of early-exiting classifiers. In this study, a weighted ensemble self-distillation method was proposed to improve the early-exiting strategy, which well balanced the performance and the inference time. It enables early-exiting classifiers to obtain rich semantic information from different layers with an attention mechanism according to the contribution of each layer to the final prediction. Furthermore, it simultaneously performs weighted ensemble self-distillation and fine-tuning of the PLMs backbone so that the PLMs can be fine-tuned in the training process of the early-exiting classifier to preserve the performance as much as possible. The experimental results show that the inference of the proposed model was accelerated at the minimum cost of performance loss, thus outperforming the previous early-exiting models. The code is available at: https://github.com/JunKong5/WestBERT.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bao, S., He, H., Wang, F., Wu, H., Wang, H.: PLATO: pre-trained dialogue generation model with discrete latent variable. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 85–96 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 4171–4186 (2019)
Google Scholar
Fan, A., Grave, E., Joulin, A.: Reducing transformer depth on demand with structured dropout. In: International Conference on Learning Representations (2019)
Google Scholar
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS-2015), pp. 1693–1701 (2015)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. (2015)
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL-2019), pp. 3651–3657 (2019)
Google Scholar
Lai, K., Yu, L.C., Zhang, X., Wang, J.: Tree-structured regional CNN-LSTM model for dimensional sentiment analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 581–591 (2019)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., Smith, N.A.: The right tool for the job: matching model and instance complexities. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6640–6651 (2020)
Google Scholar
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression (2019)
Google Scholar
Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., Lin, J.: Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 (2019)
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: International Conference on Pattern Recognition, pp. 2464–2469 (2016)
Google Scholar
Xin, J., Tang, R., Lee, J., Yu, Y., Lin, J.: DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: sequence generation model for multi-label classification. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3915–3926 (2018)
Google Scholar
Zafrir, O., Boudoukh, G., Izsak, P., Wasserblat, M.: Q8BERT: quantized 8Bit BERT. arXiv preprint arXiv:1910.06188 (2019)
Zhang, Y., Wang, J., Zhang, X.: Learning sentiment sentence representation with multiview attention model. Inf. Sci. 571, 459–474 (2021)
Article MathSciNet Google Scholar
Zhang, Y., Wang, J., Zhang, X.: Personalized sentiment classification of customer reviews via an interactive attributes attention model. Knowl.-Based Syst. 226, 107135 (2021)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants Nos. 61702443, 61966038 and 61762091. The authors would like to thank the anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

School of Information Sciences and Engineering, Yunnan University, Kunming, China
Jun Kong, Jin Wang & Xuejie Zhang

Authors

Jun Kong
View author publications
You can also search for this author in PubMed Google Scholar
Jin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xuejie Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jin Wang .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Lu Wang
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong
Tianjin University, Tianjin, China
Ruifang He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kong, J., Wang, J., Zhang, X. (2021). Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation. In: Wang, L., Feng, Y., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2021. Lecture Notes in Computer Science(), vol 13028. Springer, Cham. https://doi.org/10.1007/978-3-030-88480-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-88480-2_18
Published: 06 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88479-6
Online ISBN: 978-3-030-88480-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)