skip to main content
10.1145/3670105.3670142acmotherconferencesArticle/Chapter ViewAbstractPublication PagescniotConference Proceedingsconference-collections
research-article

Chinese Text Error Correction Based on PE-T5 Model

Published: 29 July 2024 Publication History

Abstract

Chinese text error correction is to automatically detect errors or inappropriate expressions in a given Chinese text and output correct correction results according to the context and application scenarios. Existing research ideas mainly focus on two categories: rule-based and depth model-based, where depth model-based methods mainly focus on solving alignment-based errors and perform poorly on non-alignment-based errors. And with the scarcity of manually labeled data, few works have considered using self-supervised training methods to pre-train the models. Therefore, for the common multiple and omitted word errors in electric power article writing, this paper proposes a generative model-based error correction method for Chinese text. A large-scale error correction dataset is first constructed using a rule-based self-supervised approach and pretrained using a generalized domain dataset. Then, in the incremental training phase, lexically supervised signals are added inside the model to enhance the model's error detection effect by means of lexical combination. Finally, the model is fine-tuned using a domain-specific self-supervised dataset. The effectiveness of the method in this paper is demonstrated by analyzing the experimental results in comparison with other models.

References

[1]
Zhang S, Huang H, Liu J, Spelling error correction with soft-masked BERT[J]. arXiv preprint arXiv:2005.07421, 2020.
[2]
Zhang Y, Li Z, Bao Z, MuCGEC: a multi-reference multi-source evaluation dataset for Chinese grammatical error correction[J]. arXiv preprint arXiv:2204.10994, 2022.
[3]
Zhao Z, Wang H. Maskgec: Improving neural grammatical error correction via dynamic masking[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(01): 1226-1233.
[4]
Ren F, Shi H, Zhou Q. A hybrid approach to automatic Chinese text checking and error correction[C]//2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236). IEEE, 2001, 3: 1693-1698.
[5]
Yousuf H, Lahzi M, Salloum S A, A systematic review on sequence-to-sequence learning with neural network and its models[J]. International Journal of Electrical & Computer Engineering (2088-8708), 2021, 11(3).
[6]
Sun X, Zhou J, Wang S, Chinese Spelling Error Detection and Correction Based on Knowledge Graph[C]//International Conference on Database Systems for Advanced Applications. Cham: Springer International Publishing, 2022: 149-159.
[7]
Xu H, He C, Zhang C, A Multi-channel Chinese Text Correction Method Based on Grammatical Error Diagnosis[C]//2022 8th International Conference on Big Data and Information Analytics (BigDIA). IEEE, 2022: 396-401.
[8]
Cao Y, He L, Ridley R, Integrating BERT and Score-based Feature Gates for Chinese Grammatical Error Diagnosis[C]// Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications. Association for Computational Linguistics, 2020.
[9]
Ruiqing Zhang, Chao Pang "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021
[10]
Xue X. Research on the Proofreading Method of Chinese Near-syntactic Mispronunciation Based on Machine Translation Model [D]. Heilongjiang University, 2017.
[11]
Grundkiewicz R, Junczys-Dowmunt M . Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation[J]. 2018.
[12]
Shao Y, Geng Z, Liu Y, Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation[J]. arXiv preprint arXiv:2109.05729, 2021.
[13]
Omelianchuk K, Atrasevych V, Chernodub A, GECToR–grammatical error correction: tag, not rewrite[J]. arXiv preprint arXiv:2005.12592, 2020.
[14]
Kamyab M, Liu G, Adjeisah M. Attention-based CNN and Bi-LSTM model based on TF-IDF and glove word embedding for sentiment analysis[J]. Applied Sciences, 2021, 11(23): 11255.
[15]
Aßenmacher M, Heumann C. On the comparability of pre-trained language models[J]. arXiv preprint arXiv:2001.00781, 2020.
[16]
Grover K, Kaur K, Tiwari K, Deep learning based question generation using t5 transformer[C]//Advanced Computing: 10th International Conference, IACC 2020, Panaji, Goa, India, December 5–6, 2020, Revised Selected Papers, Part I 10. Springer Singapore, 2021: 243-255.
[17]
Raffel C, Shazeer N, Roberts A, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. 2020(140).
[18]
Zhang Y, Li Z, Bao Z, MuCGEC: a multi-reference multi-source evaluation dataset for Chinese grammatical error correction[J]. arXiv preprint arXiv:2204.10994, 2022.
[19]
Zhang Y, Li Z, Bao Z, MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction[J]. 2022.
[20]
Lewis M, Liu Y, Goyal N, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[C]// Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CNIOT '24: Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things
May 2024
668 pages
ISBN:9798400716751
DOI:10.1145/3670105
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 July 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep learning,Text error correction,Self-supervision
  2. Natural language processing
  3. T5

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • State Grid Shandong Electric Power Company Technology Project

Conference

CNIOT 2024

Acceptance Rates

Overall Acceptance Rate 39 of 82 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 20
    Total Downloads
  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)7
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media