research-article

Chinese Text Error Correction Based on PE-T5 Model

Authors:

Yifei QiAuthors Info & Claims

CNIOT '24: Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things

Pages 223 - 227

https://doi.org/10.1145/3670105.3670142

Published: 29 July 2024 Publication History

Abstract

Chinese text error correction is to automatically detect errors or inappropriate expressions in a given Chinese text and output correct correction results according to the context and application scenarios. Existing research ideas mainly focus on two categories: rule-based and depth model-based, where depth model-based methods mainly focus on solving alignment-based errors and perform poorly on non-alignment-based errors. And with the scarcity of manually labeled data, few works have considered using self-supervised training methods to pre-train the models. Therefore, for the common multiple and omitted word errors in electric power article writing, this paper proposes a generative model-based error correction method for Chinese text. A large-scale error correction dataset is first constructed using a rule-based self-supervised approach and pretrained using a generalized domain dataset. Then, in the incremental training phase, lexically supervised signals are added inside the model to enhance the model's error detection effect by means of lexical combination. Finally, the model is fine-tuned using a domain-specific self-supervised dataset. The effectiveness of the method in this paper is demonstrated by analyzing the experimental results in comparison with other models.

References

[1]

Zhang S, Huang H, Liu J, Spelling error correction with soft-masked BERT[J]. arXiv preprint arXiv:2005.07421, 2020.

[2]

Zhang Y, Li Z, Bao Z, MuCGEC: a multi-reference multi-source evaluation dataset for Chinese grammatical error correction[J]. arXiv preprint arXiv:2204.10994, 2022.

[3]

Zhao Z, Wang H. Maskgec: Improving neural grammatical error correction via dynamic masking[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(01): 1226-1233.

[4]

Ren F, Shi H, Zhou Q. A hybrid approach to automatic Chinese text checking and error correction[C]//2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236). IEEE, 2001, 3: 1693-1698.

[5]

Yousuf H, Lahzi M, Salloum S A, A systematic review on sequence-to-sequence learning with neural network and its models[J]. International Journal of Electrical & Computer Engineering (2088-8708), 2021, 11(3).

[6]

Sun X, Zhou J, Wang S, Chinese Spelling Error Detection and Correction Based on Knowledge Graph[C]//International Conference on Database Systems for Advanced Applications. Cham: Springer International Publishing, 2022: 149-159.

[7]

Xu H, He C, Zhang C, A Multi-channel Chinese Text Correction Method Based on Grammatical Error Diagnosis[C]//2022 8th International Conference on Big Data and Information Analytics (BigDIA). IEEE, 2022: 396-401.

[8]

Cao Y, He L, Ridley R, Integrating BERT and Score-based Feature Gates for Chinese Grammatical Error Diagnosis[C]// Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications. Association for Computational Linguistics, 2020.

[9]

Ruiqing Zhang, Chao Pang "Correcting Chinese Spelling Errors with Phonetic Pre-training", ACL, 2021

[10]

Xue X. Research on the Proofreading Method of Chinese Near-syntactic Mispronunciation Based on Machine Translation Model [D]. Heilongjiang University, 2017.

[11]

Grundkiewicz R, Junczys-Dowmunt M . Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation[J]. 2018.

[12]

Shao Y, Geng Z, Liu Y, Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation[J]. arXiv preprint arXiv:2109.05729, 2021.

[13]

Omelianchuk K, Atrasevych V, Chernodub A, GECToR–grammatical error correction: tag, not rewrite[J]. arXiv preprint arXiv:2005.12592, 2020.

[14]

Kamyab M, Liu G, Adjeisah M. Attention-based CNN and Bi-LSTM model based on TF-IDF and glove word embedding for sentiment analysis[J]. Applied Sciences, 2021, 11(23): 11255.

[15]

Aßenmacher M, Heumann C. On the comparability of pre-trained language models[J]. arXiv preprint arXiv:2001.00781, 2020.

[16]

Grover K, Kaur K, Tiwari K, Deep learning based question generation using t5 transformer[C]//Advanced Computing: 10th International Conference, IACC 2020, Panaji, Goa, India, December 5–6, 2020, Revised Selected Papers, Part I 10. Springer Singapore, 2021: 243-255.

[17]

Raffel C, Shazeer N, Roberts A, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer[J]. 2020(140).

[18]

Zhang Y, Li Z, Bao Z, MuCGEC: a multi-reference multi-source evaluation dataset for Chinese grammatical error correction[J]. arXiv preprint arXiv:2204.10994, 2022.

[19]

Zhang Y, Li Z, Bao Z, MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction[J]. 2022.

[20]

Lewis M, Liu Y, Goyal N, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension[C]// Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.

Recommendations

Error Correction of Japanese Character-Recognition in Answers to Writing-Type Questions Using T5
Document Analysis Systems
Abstract
This paper proposes a method for correcting character-recognition errors in Japanese handwritten answers to writing-type questions from exercise books. We created a model to correct character-recognition errors by fine-tuning the text-to-text-...
Chinese Text Error Correction Suggestion Generation Based on SoundShape Code
Chinese Lexical Semantics
Abstract
Text error correction is an essential part of text proofreading. This paper presents a method for generating text error correction suggestion based on SoundShape Code. By converting the target words into SoundShape Code and using an improved ...
BEDSpell: Spelling Error Correction Using BERT-Based Masked Language Model and Edit Distance
Service-Oriented Computing – ICSOC 2022 Workshops
Abstract
The spelling correction problem, the task of automatically correcting misspellings in a text, is critical in natural language processing (NLP). Although it can be considered a standalone task, in most cases, it is an integral component of various ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CNIOT '24: Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things

May 2024

668 pages

ISBN:9798400716751

DOI:10.1145/3670105

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

State Grid Shandong Electric Power Company Technology Project

Conference

CNIOT 2024

CNIOT 2024: 2024 5th International Conference on Computing, Networks and Internet of Things

May 24 - 26, 2024

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 39 of 82 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
20
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten