Abstract
Chinese Spelling Correction (CSC) is a challenging task that requires the ability to model the language and capture the implicit pattern of spelling error generation. In this paper, we propose PGBERT as Phonology and Glyph Enhanced Pre-training for CSC. For phonology, PGBERT uses Bi-GRU to encode single Chinese character’s Pinyin sequence as phonology embedding. For glyph, we introduce Ideographic Description Sequence (IDS) to decompose Chinese character into binary tree of basic strokes, and then an encoder based on gated units is utilized to encode the glyph tree structure recursively. At each layer of original model, PGBERT extends extra channels for phonology and glyph encoding respectively, then performs a multi-channel fusion function and a residual connection to yield an output for each channel. Empirical analysis shows PGBERT is a powerful method for CSC and achieves state-of-the-art performance on widely-used benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Burstein, J., Chodorow, M.: Automated essay scoring for nonnative English speakers. In: Computer Mediated Language Assessment And Evaluation in Natural Language Processing (1999)
Chang, C.H.: A new approach for automatic Chinese spelling correction. In: Proceedings of Natural Language Processing Pacific Rim Symposium, vol. 95, pp. 278–283. Citeseer (1995)
Cheng, X., et al.: SpellGCN: incorporating phonological and visual similarities into language models for Chinese spelling check. arXiv preprint arXiv:2004.14166 (2020)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Chollampatt, S., Taghipour, K., Ng, H.T.: Neural network translation models for grammatical error correction. arXiv preprint arXiv:1606.00189 (2016)
Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., Hu, G.: Pre-training with whole word masking for chinese bert (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fung, G., Debosschere, M., Wang, D., Li, B., Zhu, J., Wong, K.F.: Nlptea 2017 shared task-Chinese spelling check. In: Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 29–34 (2017)
Gao, J., Quirk, C., et al.: A large scale ranker-based system for search query spelling correction. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics (2010)
Ge, T., Wei, F., Zhou, M.: Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1055–1065 (2018)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). arXiv preprint arXiv:1606.08415 (2016)
Hong, Y., Yu, X., He, N., Liu, N., Liu, J.: FASPell: a fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In: Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), pp. 160–169 (2019)
Huang, C., Pan, H., Ming, Z., Zhang, L.: Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In: ACL 2000: Proceedings of the 38th Annual Meeting on Association for Computational Linguistic (2000)
Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., Gao, J.: A nested attention neural hybrid model for grammatical error correction. arXiv preprint arXiv:1707.02026 (2017)
Jia, Z., Wang, P., Zhao, H.: Graph model for Chinese spell checking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 88–92 (2013)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Liu, C.L., Lai, M.H., Chuang, Y.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect simplified chinese words. In: Coling 2010: Posters. pp. 739–747 (2010)
Liu, S., Yang, T., Yue, T., Zhang, F., Wang, D.: PLOME: pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2991–3000 (2021)
Liu, X., Cheng, K., Luo, Y., Duh, K., Matsumoto, Y.: A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 54–58 (2013)
Martins, Bruno, Silva, Mário. J..: Spelling correction for search engine queries. In: Vicedo, José Luis., Martínez-Barco, Patricio, Muńoz, Rafael, Saiz Noeda, Maximiliano (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 372–383. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_33
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015)
Tseng, Y.H., Lee, L.H., Chang, L.P., Chen, H.H.: Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In: Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pp. 32–37 (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)
Wang, D., Tay, Y., Zhong, L.: Confusionset-guided pointer networks for Chinese spelling check. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5780–5785 (2019)
Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)
Xin, Y., Zhao, H., Wang, Y., Jia, Z.: An improved graph model for Chinese spell checking. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. pp. 157–166 (2014)
Xu, H.D., Li, Z., Zhou, Q., Li, C., Mao, X.L.: Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021)
Yang, W., et al.: End-to-end open-domain question answering with BERTserini. arXiv preprint arXiv:1902.01718 (2019)
Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)
Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)
Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 882–890 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bao, L., Chen, X., Ren, J., Liu, Y., Qi, C. (2022). PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-17120-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)