skip to main content
10.1145/3404835.3463050acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction

Published: 11 July 2021 Publication History

Abstract

Spelling Error Correction (SEC) that detects and corrects spelling errors in a text has a wide range of applications in human language understanding. Earlier solutions, including statistic-based methods, one-stage, and two-stage machine learning-based methods, cannot build deeply bidirectional models and significantly confine the learning ability. With the recently emerging masked language models, transformer-based networks have achieved remarkable success in SEC. However, current transformer-based Chinese SEC algorithms are all end-to-end methods, which suffer from high false alarm rates because they correct each character of the sentence regardless of its correctness. This issue becomes even more severe when there exist only a small fraction of incorrect characters in the whole sentence. To solve this problem, we propose a cloze-style detector-corrector framework (DCSpell) that firstly detects whether a character is erroneous before correcting it. Specifically, DCSpell employs the discriminator of ELECTRA as the Detector to detect the positions of incorrect characters. The Detector is trained by a sample-efficient replaced token detection pre-training task, and thus allows domain adaption with a small amount of data. After that, a transformer-based Corrector is used to find the correct character for each detected position. It employs sentence pairs as the input, which potentially incorporates the knowledge of phonological and visual similarity. A confusion-set-based post-processing is used to further improve the performance. Experiments show that DCSpell achieves 15.7% improvement on the SIGHAN dataset and 6.6% improvement on a dataset transcribed from a real-world acoustic speech corpus compared to the state-of-the-art methods in terms of the F1 score.

Supplementary Material

MP4 File (video_ppt.mp4)
Spelling Error Correction (SEC) has a wide range of applications. Current Chinese SEC algorithms are all end-to-end methods, which suffer from high false alarm rates. We propose a detector-corrector framework (DCSpell). DCSpell employs the discriminator of ELECTRA as the Detector to detect the positions of incorrect characters. The Detector uses a ELECTRA-like pretraining, allowing domain adaption with a small amount of data. A transformer-based Corrector is used to find the correct character for each detected position. It employs sentence pairs as the input, which potentially incorporates the knowledge of phonological and visual similarity. A confusion-set-based post-processing is used to further improve the performance. Experiments show that DCSpell achieves 15.7% improvement on the SIGHAN dataset and 6.6% improvement on a dataset transcribed from a real-world acoustic speech corpus compared to the state-of-the-art methods in terms of the F1 score.

References

[1]
Ltd Beijing DataTang Technology Co. 2019. aidatatang 200zh. http://www.openslr.org/62/.
[2]
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 1--5.
[3]
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR. https://openreview.net/pdf?id=r1xMH1BtvB
[4]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019).
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[6]
Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5884--5888.
[7]
Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and Junhui Liu. 2019. FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019). 160--169.
[8]
Chiao-Wen Li, Jhih-Jie Chen, and Jason S Chang. 2018. Chinese Spelling Check based on Neural Machine Translation. In PACLIC .
[9]
C.-L. Liu, M.-H. Lai, K.-W. Tien, Y.-H. Chuang, S.-H. Wu, and C.-Y. Lee. 2011. Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications. ACM Transactions on Asian Language Information Processing, Vol. 10, 2, Article 10 (June 2011), 39 pages. https://doi.org/10.1145/1967293.1967297
[10]
Ltd. Magic Data Technology Co. 2019. MAGICDATA Mandarin Chinese Read Speech Corpus. http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101.
[11]
Ltd. Primewords Information Technology Co. 2018. Primewords Chinese Corpus Set 1. https://www.primewords.cn.
[12]
Zhaoquan Qiu and Youli Qu. 2019. A two-stage model for Chinese grammatical error correction. IEEE Access, Vol. 7 (2019), 146772--146777.
[13]
Surfingtech. 2017. Free ST Chinese Mandarin Corpus. http://www.openslr.org/38/.
[14]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.
[15]
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and Hsin-Hsi Chen. 2015. Introduction to sighan 2015 bake-off for chinese spelling check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing. 32--37.
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[17]
Dingmin Wang, Yi Tay, and Li Zhong. 2019. Confusionset-guided pointer networks for Chinese spelling check. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5780--5785.
[18]
Dong Wang and Xuewei Zhang. 2015. Thchs-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882 (2015).
[19]
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. 35--42.
[20]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems. 5753--5763.
[21]
Jui-Feng Yeh, Li-Ting Chang, Chan-Yi Liu, and Tsung-Wei Hsu. 2017. Chinese spelling check based on N-gram and string matching algorithm. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). 35--38.
[22]
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 bake-off for Chinese spelling check. In Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126--132.
[23]
Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang Li. 2020. Spelling Error Correction with Soft-Masked BERT. arXiv preprint arXiv:2005.07421 (2020).

Cited By

View all
  • (2025)A transformer-based spelling error correction framework for Bangla and resource scarce Indic languagesComputer Speech & Language10.1016/j.csl.2024.10170389(101703)Online publication date: Jan-2025
  • (2025)Efficient word segmentation for enhancing Chinese spelling check in pre-trained language modelKnowledge and Information Systems10.1007/s10115-024-02230-367:1(603-632)Online publication date: 1-Jan-2025
  • (2024)MISpeller: Multimodal Information Enhancement for Chinese Spelling CorrectionIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7269E107.D:10(1342-1352)Online publication date: 1-Oct-2024
  • Show More Cited By

Index Terms

  1. DCSpell: A Detector-Corrector Framework for Chinese Spelling Error Correction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2021
      2998 pages
      ISBN:9781450380379
      DOI:10.1145/3404835
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 July 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. attention models
      2. language models
      3. query rewrite
      4. spelling correction

      Qualifiers

      • Short-paper

      Conference

      SIGIR '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)53
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)A transformer-based spelling error correction framework for Bangla and resource scarce Indic languagesComputer Speech & Language10.1016/j.csl.2024.10170389(101703)Online publication date: Jan-2025
      • (2025)Efficient word segmentation for enhancing Chinese spelling check in pre-trained language modelKnowledge and Information Systems10.1007/s10115-024-02230-367:1(603-632)Online publication date: 1-Jan-2025
      • (2024)MISpeller: Multimodal Information Enhancement for Chinese Spelling CorrectionIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7269E107.D:10(1342-1352)Online publication date: 1-Oct-2024
      • (2024)DBCERT: Reconstruct the BERT Model for Chinese Spelling Correction2024 IEEE International Symposium on Product Compliance Engineering - Asia (ISPCE-ASIA)10.1109/ISPCE-ASIA64773.2024.10756270(1-6)Online publication date: 25-Oct-2024
      • (2024)A Lightweight Chinese Multimodal Textual Defense Method based on Contrastive-Adversarial Training2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10649990(1-10)Online publication date: 30-Jun-2024
      • (2024)Bridging the Gap: A Self-Learning Model Using Implicit Knowledge for Chinese Spelling CorrectionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448357(12286-12290)Online publication date: 14-Apr-2024
      • (2024)Neural Spell-Checker: Beyond Words with Synthetic Data GenerationText, Speech, and Dialogue10.1007/978-3-031-70563-2_7(85-96)Online publication date: 9-Sep-2024
      • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 22-Mar-2023
      • (2023)CCCSpell: A Consistent and Contrastive Learning Approach with Character Similarity for Chinese Spelling Check2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191137(1-8)Online publication date: 18-Jun-2023
      • (2022)Chinese Spelling Error Correction Based on Adaptive Pre-training and Multi-Task Learning2022 4th International Conference on Applied Machine Learning (ICAML)10.1109/ICAML57167.2022.00030(124-128)Online publication date: Jul-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media