Skip to main content

PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Abstract

Chinese Spelling Correction (CSC) is a challenging task that requires the ability to model the language and capture the implicit pattern of spelling error generation. In this paper, we propose PGBERT as Phonology and Glyph Enhanced Pre-training for CSC. For phonology, PGBERT uses Bi-GRU to encode single Chinese character’s Pinyin sequence as phonology embedding. For glyph, we introduce Ideographic Description Sequence (IDS) to decompose Chinese character into binary tree of basic strokes, and then an encoder based on gated units is utilized to encode the glyph tree structure recursively. At each layer of original model, PGBERT extends extra channels for phonology and glyph encoding respectively, then performs a multi-channel fusion function and a residual connection to yield an output for each channel. Empirical analysis shows PGBERT is a powerful method for CSC and achieves state-of-the-art performance on widely-used benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/cjkvi/cjkvi-ids.

  2. 2.

    https://github.com/suzhoushr/nlp_chinese_corpus.

References

  1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  2. Burstein, J., Chodorow, M.: Automated essay scoring for nonnative English speakers. In: Computer Mediated Language Assessment And Evaluation in Natural Language Processing (1999)

    Google Scholar 

  3. Chang, C.H.: A new approach for automatic Chinese spelling correction. In: Proceedings of Natural Language Processing Pacific Rim Symposium, vol. 95, pp. 278–283. Citeseer (1995)

    Google Scholar 

  4. Cheng, X., et al.: SpellGCN: incorporating phonological and visual similarities into language models for Chinese spelling check. arXiv preprint arXiv:2004.14166 (2020)

  5. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  6. Chollampatt, S., Taghipour, K., Ng, H.T.: Neural network translation models for grammatical error correction. arXiv preprint arXiv:1606.00189 (2016)

  7. Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., Wang, S., Hu, G.: Pre-training with whole word masking for chinese bert (2019)

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  9. Fung, G., Debosschere, M., Wang, D., Li, B., Zhu, J., Wong, K.F.: Nlptea 2017 shared task-Chinese spelling check. In: Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 29–34 (2017)

    Google Scholar 

  10. Gao, J., Quirk, C., et al.: A large scale ranker-based system for search query spelling correction. In: COLING ’10: Proceedings of the 23rd International Conference on Computational Linguistics (2010)

    Google Scholar 

  11. Ge, T., Wei, F., Zhou, M.: Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1055–1065 (2018)

    Google Scholar 

  12. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). arXiv preprint arXiv:1606.08415 (2016)

  13. Hong, Y., Yu, X., He, N., Liu, N., Liu, J.: FASPell: a fast, adaptable, simple, powerful Chinese spell checker based on DAE-decoder paradigm. In: Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), pp. 160–169 (2019)

    Google Scholar 

  14. Huang, C., Pan, H., Ming, Z., Zhang, L.: Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In: ACL 2000: Proceedings of the 38th Annual Meeting on Association for Computational Linguistic (2000)

    Google Scholar 

  15. Ji, J., Wang, Q., Toutanova, K., Gong, Y., Truong, S., Gao, J.: A nested attention neural hybrid model for grammatical error correction. arXiv preprint arXiv:1707.02026 (2017)

  16. Jia, Z., Wang, P., Zhao, H.: Graph model for Chinese spell checking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 88–92 (2013)

    Google Scholar 

  17. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  18. Liu, C.L., Lai, M.H., Chuang, Y.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect simplified chinese words. In: Coling 2010: Posters. pp. 739–747 (2010)

    Google Scholar 

  19. Liu, S., Yang, T., Yue, T., Zhang, F., Wang, D.: PLOME: pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2991–3000 (2021)

    Google Scholar 

  20. Liu, X., Cheng, K., Luo, Y., Duh, K., Matsumoto, Y.: A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 54–58 (2013)

    Google Scholar 

  21. Martins, Bruno, Silva, Mário. J..: Spelling correction for search engine queries. In: Vicedo, José Luis., Martínez-Barco, Patricio, Muńoz, Rafael, Saiz Noeda, Maximiliano (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 372–383. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_33

    Chapter  Google Scholar 

  22. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  23. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2015)

    Google Scholar 

  24. Tseng, Y.H., Lee, L.H., Chang, L.P., Chen, H.H.: Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In: Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, pp. 32–37 (2015)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  26. Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)

    Google Scholar 

  27. Wang, D., Tay, Y., Zhong, L.: Confusionset-guided pointer networks for Chinese spelling check. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5780–5785 (2019)

    Google Scholar 

  28. Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)

    Google Scholar 

  29. Xin, Y., Zhao, H., Wang, Y., Jia, Z.: An improved graph model for Chinese spell checking. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. pp. 157–166 (2014)

    Google Scholar 

  30. Xu, H.D., Li, Z., Zhou, Q., Li, C., Mao, X.L.: Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021)

    Google Scholar 

  31. Yang, W., et al.: End-to-end open-domain question answering with BERTserini. arXiv preprint arXiv:1902.01718 (2019)

  32. Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)

    Google Scholar 

  33. Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)

    Google Scholar 

  34. Zhang, S., Huang, H., Liu, J., Li, H.: Spelling error correction with soft-masked BERT. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 882–890 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lujia Bao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bao, L., Chen, X., Ren, J., Liu, Y., Qi, C. (2022). PGBERT: Phonology and Glyph Enhanced Pre-training for Chinese Spelling Correction. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17120-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17119-2

  • Online ISBN: 978-3-031-17120-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics