skip to main content
research-article

Correcting Chinese Spelling Errors with Word Lattice Decoding

Published: 11 November 2015 Publication History

Abstract

Chinese spell checkers are more difficult to develop because of two language features: 1) there are no word boundaries, and a character may function as a word or a word morpheme; and 2) the Chinese character set contains more than ten thousand characters. The former makes it difficult for a spell checker to detect spelling errors, and the latter makes it difficult for a spell checker to construct error models. We develop a word lattice decoding model for a Chinese spell checker that addresses these difficulties. The model performs word segmentation and error correction simultaneously, thereby solving the word boundary problem. The model corrects nonword errors as well as real-word errors. In order to better estimate the error distribution of large character sets for error models, we also propose a methodology to extract spelling error samples automatically from the Google web 1T corpus. Due to the large quantity of data in the Google web 1T corpus, many spelling error samples can be extracted, better reflecting spelling error distributions in the real world. Finally, in order to improve the spell checker for real applications, we produce n-best suggestions for spelling error corrections. We test our proposed approach with the Bakeoff 2013 CSC Datasets; the results show that the proposed methods with the error model significantly outperform the performance of Chinese spell checkers that do not use error models.

References

[1]
Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95). 278--283.
[2]
Keh-Jiann Chen and Ming-Hong Bai. 1998. Unknown word detection for Chinese by a corpus-based learning method. Int. J. Comput. Linguistics Chinese Language Process. 3, 1, 27--44.
[3]
Keh-Jiann Chen and Wei-Yun Ma. 2002. Unknown word extraction for Chinese documents. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), 1--7.
[4]
Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the 21st Conference on Computational Linguistics and Speech Processing (ROCLING’09). 359--372. {In Chinese}
[5]
Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, Tsun Ku, and Gwo-Dong Chen. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116.
[6]
Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.
[7]
Fred J. Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Commun. ACM. 7, 3, 171--176.
[8]
Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a Unified Approach to Statistical Language Modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33.
[9]
Hung-Yan Gu, Chiu-Yu Tseng, and Lin-Shan Lee. 1991. Markov modeling of Mandarin Chinese for decoding the phonetic sequence into Chinese characters. Computer Speech Lang. 5, 4, 363--377.
[10]
Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.
[11]
Chuen-Ming Huang, Mei-Che Wu, and Ching-Che Chang. 2007. Error detection and correction based on Chinese phonemic alphabet in Chinese text. In Proceedings of the 4th Conference on Modeling Decisions for Artificial Intelligence. 463--476.
[12]
Ta-Hung Hung, Shih-Hung Wu, Tsun Ku, and Wen-Nan Wang. 2008. Chinese essay analysis language model information retrieval. In Proceedings of the Taiwan E-Learning Forum (TWELF’08).
[13]
Mark D. Kernighan, Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the 13th International Conference on Computational Linguistics (COLING’90). 205--210.
[14]
Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.
[15]
Lin-Shan Lee, Chiu-Yu Tseng, Hung-Yan Gu, F.-H. Liu, C.H. Chang, Y.H. Lin, Yumin Lee, S.L. Tu, S.H. Hsieh, and C.H. Chen. 1993a. Golden Mandarin (I): A real-time Mandarin Speech dictation machine for Chinese language with very large vocabulary. IEEE Trans. Speech Audio Process. 1, 2, 158--179.
[16]
Lin-Shan Lee, C.-Y. Tseng, and K.-J. Chen, et al. 1993b. Golden Mandarin: An improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’93). Vol. 2, 503--506.
[17]
Yih-Jeng Lin, Feng-Long Huang, and Ming-Shing Yu. 2002. A Chinese spelling error correction system. In Proceedings of the 7th Conference on Artificial Intelligence and Applications.
[18]
Chao-Lin Liu and Jen-Hsiang Lin. 2008. Using structural information for identifying similar Chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 93--96.
[19]
Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009a. Phonological and logographic influences on errors in written Chinese words. In Proceedings of the 7th Workshop on Asian Language Resources. 84--91.
[20]
Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009b. Capturing errors in written Chinese words. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACLIJCNLP’09). 25--28.
[21]
Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 1--39.
[22]
Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inf. Process. Manage. 27, 5, 517--522.
[23]
MOE. 1994. The Standard Form of National Characters -- Instructor’s Manual. Ministry of Education, Taiwan. http://www.edu.tw/files/site_content/M0001/std/c4.htm.
[24]
James L. Peterson. 1986. A note on undetected typing errors. Commun. ACM. 29, 7, 633--637.
[25]
Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic Chinese text checking and error correction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Vol. 3, 1693--1698.
[26]
Hinrich Schütze. 1998. Automatic word sense discrimination. Comput. Linguistics. 24, 1, 97--123.
[27]
Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3, 379--423.
[28]
Unicode Consortium. 2014. The Unicode Standard 7.0. http://www.unicode.org.
[29]
Jian-Cheng Wu, Hsun-Wen Chiu, and Jason S. Chang. 2013. Integrating dictionary and web n-grams for Chinese spell checking. Int. J. Comput. Linguistics Chinese Language Process. 18, 4, 17--30.
[30]
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceeding of the 7th SIGHAN Workshop on Chinese Language Processing (SIGHAN’13). 35--42.
[31]
Kae-Cherng Yang, Tai-Hsuan Ho, Lee-Feng Chien, and Lin-Shan Lee. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing. 169--172.
[32]
Lei Zhang, Zhou Ming, Changning Huang, and Mingyu Lu. 2000. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. In Proceedings of the 3rd World Congress on Intelligent Control and Automation. 2744--2748.

Cited By

View all
  • (2024)Research on digital entertainment technology and gaming methods based on hidden Markov models in English e-learning classroom modeEntertainment Computing10.1016/j.entcom.2024.100856(100856)Online publication date: Jul-2024
  • (2023)SPECIL: Spell Error Corpus for the Indonesian LanguageIEEE Access10.1109/ACCESS.2023.330771211(93227-93237)Online publication date: 2023
  • (2019)Post Text Processing of Chinese Speech Recognition Based on Bidirectional LSTM Networks and CRFElectronics10.3390/electronics81112488:11(1248)Online publication date: 31-Oct-2019
  • Show More Cited By

Index Terms

  1. Correcting Chinese Spelling Errors with Word Lattice Decoding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 4
    Special Issue on Chinese Spell Checking
    October 2015
    92 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/2845556
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 November 2015
    Accepted: 01 June 2015
    Revised: 01 June 2015
    Received: 01 August 2014
    Published in TALLIP Volume 14, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese spelling error checking
    2. computer-assisted language learning
    3. noisy channel model
    4. unknown word detection
    5. word lattice
    6. word segmentation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Research on digital entertainment technology and gaming methods based on hidden Markov models in English e-learning classroom modeEntertainment Computing10.1016/j.entcom.2024.100856(100856)Online publication date: Jul-2024
    • (2023)SPECIL: Spell Error Corpus for the Indonesian LanguageIEEE Access10.1109/ACCESS.2023.330771211(93227-93237)Online publication date: 2023
    • (2019)Post Text Processing of Chinese Speech Recognition Based on Bidirectional LSTM Networks and CRFElectronics10.3390/electronics81112488:11(1248)Online publication date: 31-Oct-2019
    • (2016)Spelling checking using conditional random fields with feature induction for secondary language learnersSmart Science10.1080/23080477.2016.11671554:1(14-21)Online publication date: 20-Apr-2016

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media