Skip to main content

MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance

  • Conference paper
  • First Online:
  • 787 Accesses

Abstract

In the past three years, the pre-trained language model is widely used in various natural language processing tasks, which has achieved significant progress. However, the high computational cost has seriously affected the efficiency of the pre-trained language model, which severely impairs the application of the pre-trained language model in resource-limited industries. To improve the efficiency of the model while ensuring the model’s accuracy, we propose MS-BERT, a multi-layer self-distillation approach for BERT compression based on Earth Mover’s Distance (EMD), which has the following features: (1) MS-BERT allows the lightweight network (student) to learn from all layers of the large model (teacher). In this way, students can learn different levels of knowledge from the teacher, which can enhance students’ performance. (2) Earth Mover’s Distance (EMD) is introduced to calculate the distance between the teacher layers and the student layers to achieve multi-layer knowledge transfer from teacher to students. (3) Two design strategies of student layers and the top-K uncertainty calculation method are proposed to improve MS-BERT’s performance. Extensive experiments conducted on different datasets have proved that our model can be 2 to 12 times faster than BERT under different accuracy losses.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/google-research/bert.

References

  1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019

    Google Scholar 

  2. Gordon, M., Duh, K., Andrews, N.: Compressing BERT: studying the effects of weight pruning on transfer learning. In: Proceedings of the 5th Workshop on Representation Learning for NLP, pp. 143–155. Association for Computational Linguistics, July 2020

    Google Scholar 

  3. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021)

    Article  Google Scholar 

  4. Graves, A.: Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983 (2016)

  5. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. CoRR arXiv:1506.02626 (2015)

  6. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  7. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. Association for Computational Linguistics, July 2019

    Google Scholar 

  8. Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics, November 2020

    Google Scholar 

  9. Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4365–4374. Association for Computational Linguistics, November 2019

    Google Scholar 

  10. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. CoRR arXiv:1909.11942 (2019)

  11. Li, J., Liu, X., Zhao, H., Xu, R., Yang, M., Jin, Y.: BERT-EMD: many-to-many layer mapping for bert compression with earth mover’s distance. arXiv preprint arXiv:2010.06133 (2020)

  12. Liu, W., Zhou, P., Wang, Z., Zhao, Z., Deng, H., Ju, Q.: FastBERT: a self-distilling BERT with adaptive inference time. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6035–6044. Association for Computational Linguistics, July 2020

    Google Scholar 

  13. Liu, X., et al.: LCQMC: a large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1952–1962. Association for Computational Linguistics, August 2018

    Google Scholar 

  14. McCarley, J., Chakravarti, R., Sil, A.: Structured pruning of a BERT-based question answering model. CoRR arXiv:1910.06360 (2019)

  15. Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)

    Google Scholar 

  16. Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE, September 2009

    Google Scholar 

  17. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  18. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR arXiv:1910.01108 (2019)

  19. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  20. Shannon, C.E.: A symbolic analysis of relay and switching circuits. Electr. Eng. 57(12), 713–723 (1938). https://doi.org/10.1109/EE.1938.6431064

    Article  Google Scholar 

  21. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4323–4332. Association for Computational Linguistics, November 2019

    Google Scholar 

  22. Torregrossa, F., Claveau, V., Kooli, N., Gravier, G., Allesiardo, R.: On the correlation of word embedding evaluation metrics. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4789–4797. European Language Resources Association, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.589

  23. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. Association for Computational Linguistics, November 2018

    Google Scholar 

  24. Wang, Y., Xu, C., Xu, C., Tao, D.: Packing convolutional neural networks in the frequency domain. IEEE Trans. Pattern Analy. Mach. Intell. 41(10), 2495–2510 (2018)

    Article  Google Scholar 

  25. Xu, Y., Qiu, X., Zhou, L., Huang, X.: Improving BERT fine-tuning via self-ensemble and self-distillation. arXiv preprint arXiv:2002.10345 (2020)

  26. Yang, Z., Shou, L., Gong, M., Lin, W., Jiang, D.: Model compression with multi-task knowledge distillation for web-scale question answering system. arXiv preprint arXiv:1904.09636 (2019)

  27. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. CoRR arXiv:1906.08237 (2019)

  28. Zhao, S., Gupta, R., Song, Y., Zhou, D.: Extreme language model compression with optimal subwords and shared projections. CoRR arXiv:1909.11687 (2019)

Download references

Acknowledgements

This research was partially sponsored by the following funds: National Key R&D Program of China (2018YFB1402800), the Fundamental Research Funds for the Provincial Universities of Zhejiang (RF-A2020007) and Zhejiang Lab (2020AA3AB05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Cao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, J., Cao, B., Wang, J., Fan, J. (2021). MS-BERT: A Multi-layer Self-distillation Approach for BERT Compression Based on Earth Mover’s Distance. In: Gao, H., Wang, X. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-030-92638-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92638-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92637-3

  • Online ISBN: 978-3-030-92638-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics