MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

Huang, Jiahui; Cao, Bin; Wang, Jiaxing; Fan, Jing

doi:10.1007/978-3-030-92638-0_19

MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

Jiahui Huang¹⁷,
Bin Cao¹⁷,
Jiaxing Wang¹⁷ &
…
Jing Fan¹⁷

Conference paper
First Online: 01 January 2022

787 Accesses

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 407))

Abstract

In the past three years, the pre-trained language model is widely used in various natural language processing tasks, which has achieved significant progress. However, the high computational cost has seriously affected the efficiency of the pre-trained language model, which severely impairs the application of the pre-trained language model in resource-limited industries. To improve the efficiency of the model while ensuring the model’s accuracy, we propose MS-BERT, a multi-layer self-distillation approach for BERT compression based on Earth Mover’s Distance (EMD), which has the following features: (1) MS-BERT allows the lightweight network (student) to learn from all layers of the large model (teacher). In this way, students can learn different levels of knowledge from the teacher, which can enhance students’ performance. (2) Earth Mover’s Distance (EMD) is introduced to calculate the distance between the teacher layers and the student layers to achieve multi-layer knowledge transfer from teacher to students. (3) Two design strategies of student layers and the top-K uncertainty calculation method are proposed to improve MS-BERT’s performance. Extensive experiments conducted on different datasets have proved that our model can be 2 to 12 times faster than BERT under different accuracy losses.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/google-research/bert.

References

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019
Google Scholar
Gordon, M., Duh, K., Andrews, N.: Compressing BERT: studying the effects of weight pruning on transfer learning. In: Proceedings of the 5th Workshop on Representation Learning for NLP, pp. 143–155. Association for Computational Linguistics, July 2020
Google Scholar
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021)
Article Google Scholar
Graves, A.: Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983 (2016)
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. CoRR arXiv:1506.02626 (2015)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. Association for Computational Linguistics, July 2019
Google Scholar
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics, November 2020
Google Scholar
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4365–4374. Association for Computational Linguistics, November 2019
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. CoRR arXiv:1909.11942 (2019)
Li, J., Liu, X., Zhao, H., Xu, R., Yang, M., Jin, Y.: BERT-EMD: many-to-many layer mapping for bert compression with earth mover’s distance. arXiv preprint arXiv:2010.06133 (2020)
Liu, W., Zhou, P., Wang, Z., Zhao, Z., Deng, H., Ju, Q.: FastBERT: a self-distilling BERT with adaptive inference time. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6035–6044. Association for Computational Linguistics, July 2020
Google Scholar
Liu, X., et al.: LCQMC: a large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1952–1962. Association for Computational Linguistics, August 2018
Google Scholar
McCarley, J., Chakravarti, R., Sil, A.: Structured pruning of a BERT-based question answering model. CoRR arXiv:1910.06360 (2019)
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
Google Scholar
Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE, September 2009
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR arXiv:1910.01108 (2019)
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
MATH Google Scholar
Shannon, C.E.: A symbolic analysis of relay and switching circuits. Electr. Eng. 57(12), 713–723 (1938). https://doi.org/10.1109/EE.1938.6431064
Article Google Scholar
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4323–4332. Association for Computational Linguistics, November 2019
Google Scholar
Torregrossa, F., Claveau, V., Kooli, N., Gravier, G., Allesiardo, R.: On the correlation of word embedding evaluation metrics. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4789–4797. European Language Resources Association, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.589
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. Association for Computational Linguistics, November 2018
Google Scholar
Wang, Y., Xu, C., Xu, C., Tao, D.: Packing convolutional neural networks in the frequency domain. IEEE Trans. Pattern Analy. Mach. Intell. 41(10), 2495–2510 (2018)
Article Google Scholar
Xu, Y., Qiu, X., Zhou, L., Huang, X.: Improving BERT fine-tuning via self-ensemble and self-distillation. arXiv preprint arXiv:2002.10345 (2020)
Yang, Z., Shou, L., Gong, M., Lin, W., Jiang, D.: Model compression with multi-task knowledge distillation for web-scale question answering system. arXiv preprint arXiv:1904.09636 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. CoRR arXiv:1906.08237 (2019)
Zhao, S., Gupta, R., Song, Y., Zhou, D.: Extreme language model compression with optimal subwords and shared projections. CoRR arXiv:1909.11687 (2019)

Download references

Acknowledgements

This research was partially sponsored by the following funds: National Key R&D Program of China (2018YFB1402800), the Fundamental Research Funds for the Provincial Universities of Zhejiang (RF-A2020007) and Zhejiang Lab (2020AA3AB05).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China
Jiahui Huang, Bin Cao, Jiaxing Wang & Jing Fan

Authors

Jiahui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Cao .

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
Xi’an Jiaotong-Liverpool University, Suzhou, China
Xinheng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, J., Cao, B., Wang, J., Fan, J. (2021). MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance. In: Gao, H., Wang, X. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-030-92638-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-92638-0_19
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92637-3
Online ISBN: 978-3-030-92638-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics