SCBERT: Single Channel BERT for Chinese Spelling Correction

Gao, Hong; Tu, Xuezhen; Guan, Donghai

doi:10.1007/978-3-031-25198-6_30

Hong Gao^13,14,
Xuezhen Tu¹³ &
Donghai Guan¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13422))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

956 Accesses

Abstract

Chinese spelling correction (CSC) and BERT pre-training task can both be regarded as text denoising. In this work, to further narrow the gap between the pre-training and CSC tasks, we present a Single Channel BERT (SCBERT) which incorporates semantics, pinyin and glyph of typos to provide effective spelling correction. In model pre-training, we introduce fuzzy pinyin and glyph of Chinese characters and adjust mask strategies to restore the pinyin or glyph information of the “[MASK]” token under certain probabilities. Therefore, we can mask out the char channel of the typo and only provide its pinyin or glyph information in order to reduce the input noise when using our models, as the char information of typos in CSC is a kind of noise. Moreover, we apply synonym replacement and sentence reordering for paraphrasing to improve the accuracy of the correction step. We conduct experiments using widely accepted benchmarks. Our method outperforms state-of-the-art approaches under zero-shot learning condition and achieves competitive results when fine-tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Local Attention Augmentation for Chinese Spelling Correction

PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction

Automatic Chinese Spelling Checking and Correction Based on Character-Based Pre-trained Contextual Representations

Notes

References

Islam, N., Islam, Z., Noor, N.: A survey on optical character recognition system. arXiv preprint arXiv:1710.05703 (2017)
Benzeghiba, M., et al.: Automatic speech recognition and speech variability: a review. Speech Commun. 49, 763–786 (2007)
Article Google Scholar
Guo, J., Sainath, T.N., Weiss, R.J.: A spelling correction model for end-to-end speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5651–5655. IEEE (2019)
Google Scholar
Liu, C.-L., Lai, M.-H., Chuang, Y.-H., Lee, C.-Y.: Visually and phonologically similar characters in incorrect simplified chinese words. In: Coling 2010: Posters, pp. 739–747 (2010)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Ji, T., Yan, H., Qiu, X.: SpellBERT: a lightweight pretrained model for chinese spelling check. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3544–3551 (2021)
Google Scholar
Zhang, A., Li, B., Wang, W., Wan, S: A Novel Text Classification Model Combining Deep Active Learning with BERT. CMC-COMPUTERS MATERIALS & CONTINUA
Google Scholar
Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., Li, J.: A unified MRC framework for named entity recognition. arXiv preprint arXiv:1910.11476 (2019)
Sun, Z., et al.: Chinesebert: Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:2106.16038 (2021)
Liu, S., Yang, T., Yue, T., Zhang, F., Wang, D.: PLOME: pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1, Long Papers, pp. 2991–3000 (2019)
Google Scholar
Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)
Google Scholar
Hong, Y., Yu, X., He, N., Liu, N., Liu, J.: FASPell: A fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In: Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pp. 160–169 (2019)
Google Scholar
Wang, D., Tay, Y., Zhong, L.: Confusionset-guided pointer networks for chinese spelling check. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5780–5785 (2018)
Google Scholar
Dai, Y., et al.: Is Whole Word Masking Always Better for Chinese BERT?: probing on Chinese Grammatical Error Correction. arXiv preprint arXiv:2203.00286 (2022)
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Xu, L., Zhang, X., Dong, Q.: CLUECorpus2020: a large-scale Chinese corpus for pre-training language model. arXiv preprint arXiv:2003.01355 (2020)
You, Y., et al.: Large batch optimization for deep learning: Training bert in 76 min. arXiv preprint arXiv:1904.00962 (2019)
Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks using large learning rates. In: Artificial intelligence and machine learning for multi-domain operations applications, p. 1100612. International Society for Optics and Photonics (2020)
Google Scholar
Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)

Download references

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Hong Gao, Xuezhen Tu & Donghai Guan
ZTE Corporation, Shenzhen, China
Hong Gao

Authors

Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xuezhen Tu
View author publications
You can also search for this author in PubMed Google Scholar
Donghai Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Gao .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
Newcastle University, Callaghan, NSW, Australia
Lin Yue
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao
Jinan University, Guangzhou, China
Xuming Han
Free University of Bozen-Bolzano, Bolzano, Italy
Diego Calvanese
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, H., Tu, X., Guan, D. (2023). SCBERT: Single Channel BERT for Chinese Spelling Correction. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13422. Springer, Cham. https://doi.org/10.1007/978-3-031-25198-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-25198-6_30
Published: 10 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25197-9
Online ISBN: 978-3-031-25198-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics