LightBERT: A Distilled Chinese BERT Model

Zhang, Yice; Li, Yihui; Xu, Peng; Xu, Ruifeng; Li, Jianxin; Shi, Guozhong; Hu, Feiran

doi:10.1007/978-3-030-96033-9_5

Yice Zhang¹³,
Yihui Li¹³,
Peng Xu¹³,
Ruifeng Xu¹³,
Jianxin Li¹⁴,
Guozhong Shi¹⁴ &
…
Feiran Hu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12987))

Included in the following conference series:

International Conference on AI and Mobile Services

425 Accesses

Abstract

Pre-trained language models (e.g. BERT) have achieve remarkable performance in most natural language understanding tasks. However, it’s difficult to apply these models to online systems for their huge amount of parameters and long inference time. Knowledge Distillation is a popular model compression technique, which could achieve considerable model structure compression with limited performance degradation. However, there are currently no knowledge distillation methods specially designed for compressing Chinese pre-trained language model and no corresponding distilled model has been publicly released. In this paper, we propose LightBERT, which is a distilled Bert model specially for Chinese Language Processing. We perform pre-training distillation under the masking language model objective with whole word masking, which is a masking strategy adapted to Chinese language characteristics. Furthermore, we adopt a multi-step distillation strategy to compress the model progressively. Experiments on CLUE benchmark show LightBERT could reduce 62.5% size of a RoBERTa model while achieving 94.5% the performance of its teacher.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. Stat 1050, 21 (2016)
Google Scholar
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Hinton, G., Dean, J., Vinyals, O.: Distilling the knowledge in a neural network, pp. 1–9, March 2014
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jiao, X., et al.: Tinybert: Distilling BERT for natural language understanding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174 (2020)
Google Scholar
Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4365–4374 (2019)
Google Scholar
Kullback, S.: Information Theory and Statistics. Courier Corporation (1997)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Google Scholar
Li, X., Yan, H., Qiu, X., Huang, X.J.: FLAT: chinese NER using flat-lattice transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842 (2020)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural. Inf. Process. Syst. 32, 14014–14024 (2019)
Google Scholar
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
Google Scholar
Romero, A., Ballas, N., Kahou, S., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. CoRR abs/1412.6550 (2015)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4323–4332 (2019)
Google Scholar
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Association for Computational Linguistics (2019)
Google Scholar
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., Chao, L.S.: Learning deep transformer models for machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822 (2019)
Google Scholar
Wu, N., Green, B., Ben, X., O’Banion, S.: Deep transformer models for time series forecasting: the influenza prevalence case. arXiv preprint arXiv:2001.08317 (2020)
Xu, L., et al.: CLUE: a Chinese language understanding evaluation benchmark. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4762–4772 (2020)
Google Scholar

Download references

Acknowledgement

This work was partially supported by the National Natural Science Foundation of China (61632011, 61876053, 62006062), the Shenzhen Foundational Research Funding (JCYJ20180507183527919), China Postdoctoral Science Foundation (2020M670912), Joint Lab of HITSZ and China Merchants Securities.

Author information

Authors and Affiliations

Joint Lab of HITSZ-CMS, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Yice Zhang, Yihui Li, Peng Xu & Ruifeng Xu
China Merchants Securities Co., Ltd., Shenzhen, China
Jianxin Li, Guozhong Shi & Feiran Hu

Authors

Yice Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Peng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ruifeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Guozhong Shi
View author publications
You can also search for this author in PubMed Google Scholar
Feiran Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruifeng Xu .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Yi Pan
University of Pittsburgh, Pittsburgh, PA, USA
Zhi-Hong Mao
University of Electronic Science and Technology of China, Chengdu, China
Lei Luo
China Gridcom Co., Ltd., Shenzhen, China
Jing Zeng
Kingdee International Software Group Co., Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y. et al. (2022). LightBERT: A Distilled Chinese BERT Model. In: Pan, Y., Mao, ZH., Luo, L., Zeng, J., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2021. AIMS 2021. Lecture Notes in Computer Science(), vol 12987. Springer, Cham. https://doi.org/10.1007/978-3-030-96033-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-96033-9_5
Published: 14 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96032-2
Online ISBN: 978-3-030-96033-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics