PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction

Bao, Lujia; Chen, XiaoShuai; Ren, Junwen; Liu, Yujia; Qi, Chao

doi:10.1007/978-3-031-17120-8_2

Lujia Bao¹¹,
XiaoShuai Chen¹²,
Junwen Ren¹³,
Yujia Liu¹² &
…
Chao Qi¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2910 Accesses
1 Citations

Abstract

Chinese Spelling Correction (CSC) is a challenging task that requires the ability to model the language and capture the implicit pattern of spelling error generation. In this paper, we propose PGBERT as Phonology and Glyph Enhanced Pre-training for CSC. For phonology, PGBERT uses Bi-GRU to encode single Chinese character’s Pinyin sequence as phonology embedding. For glyph, we introduce Ideographic Description Sequence (IDS) to decompose Chinese character into binary tree of basic strokes, and then an encoder based on gated units is utilized to encode the glyph tree structure recursively. At each layer of original model, PGBERT extends extra channels for phonology and glyph encoding respectively, then performs a multi-channel fusion function and a residual connection to yield an output for each channel. Empirical analysis shows PGBERT is a powerful method for CSC and achieves state-of-the-art performance on widely-used benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SCBERT: Single Channel BERT for Chinese Spelling Correction

Local Attention Augmentation for Chinese Spelling Correction

PERCORE: A Deep Learning-Based Framework for Persian Spelling Correction with Phonetic Analysis

Article Open access 08 May 2024

Notes

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Burstein, J., Chodorow, M.: Automated essay scoring for nonnative English speakers. In: Computer Mediated Language Assessment And Evaluation in Natural Language Processing (1999)
Google Scholar
Chang, C.H.: A new approach for automatic Chinese spelling correction. In: Proceedings of Natural Language Processing Pacific Rim Symposium, vol. 95, pp. 278–283. Citeseer (1995)
Google Scholar
Cheng, X., et al.: SpellGCN: incorporating phonological and visual similarities into language models for Chinese spelling check. arXiv preprint arXiv:2004.14166 (2020)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Chollampatt, S., Taghipour, K., Ng, H.T.: Neural network translation models for grammatical error correction. arXiv preprint arXiv:1606.00189 (2016)
Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., Hu, G.: Pre-training with whole word masking for chinese bert (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fung, G., Debosschere, M., Wang, D., Li, B., Zhu, J., Wong, K.F.: Nlptea 2017 shared task-Chinese spelling check. In: Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 29–34 (2017)
Google Scholar
Gao, J., Quirk, C., et al.: A large scale ranker-based system for search query spelling correction. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics (2010)
Google Scholar
Ge, T., Wei, F., Zhou, M.: Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1055–1065 (2018)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). arXiv preprint arXiv:1606.08415 (2016)
Hong, Y., Yu, X., He, N., Liu, N., Liu, J.: FASPell: a fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In: Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), pp. 160–169 (2019)
Google Scholar
Huang, C., Pan, H., Ming, Z., Zhang, L.: Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In: ACL 2000: Proceedings of the 38th Annual Meeting on Association for Computational Linguistic (2000)
Google Scholar
Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., Gao, J.: A nested attention neural hybrid model for grammatical error correction. arXiv preprint arXiv:1707.02026 (2017)
Jia, Z., Wang, P., Zhao, H.: Graph model for Chinese spell checking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 88–92 (2013)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Liu, C.L., Lai, M.H., Chuang, Y.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect simplified chinese words. In: Coling 2010: Posters. pp. 739–747 (2010)
Google Scholar
Liu, S., Yang, T., Yue, T., Zhang, F., Wang, D.: PLOME: pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2991–3000 (2021)
Google Scholar
Liu, X., Cheng, K., Luo, Y., Duh, K., Matsumoto, Y.: A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 54–58 (2013)
Google Scholar
Martins, Bruno, Silva, Mário. J..: Spelling correction for search engine queries. In: Vicedo, José Luis., Martínez-Barco, Patricio, Muńoz, Rafael, Saiz Noeda, Maximiliano (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 372–383. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_33
Chapter Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015)
Google Scholar
Tseng, Y.H., Lee, L.H., Chang, L.P., Chen, H.H.: Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In: Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pp. 32–37 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)
Google Scholar
Wang, D., Tay, Y., Zhong, L.: Confusionset-guided pointer networks for Chinese spelling check. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5780–5785 (2019)
Google Scholar
Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)
Google Scholar
Xin, Y., Zhao, H., Wang, Y., Jia, Z.: An improved graph model for Chinese spell checking. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. pp. 157–166 (2014)
Google Scholar
Xu, H.D., Li, Z., Zhou, Q., Li, C., Mao, X.L.: Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021)
Google Scholar
Yang, W., et al.: End-to-end open-domain question answering with BERTserini. arXiv preprint arXiv:1902.01718 (2019)
Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)
Google Scholar
Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)
Google Scholar
Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 882–890 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Lujia Bao
Tencent, Beijing, China
XiaoShuai Chen, Yujia Liu & Chao Qi
Beijing Institute of Technology, Beijing, China
Junwen Ren

Authors

Lujia Bao
View author publications
You can also search for this author in PubMed Google Scholar
XiaoShuai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junwen Ren
View author publications
You can also search for this author in PubMed Google Scholar
Yujia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lujia Bao .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, L., Chen, X., Ren, J., Liu, Y., Qi, C. (2022). PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_2
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction