Abstract
Due to the lack of parallel data in current grammatical error correction (GEC) task, models based on sequence to sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data. The first approach is to corrupt each word in the monolingual corpus with a fixed probability, including replacement, insertion and deletion. Another approach is to train error generation models and further filtering the decoding results of the models. The experiments on different synthetic data show that the error rate is 40% and that the ratio of error types is the same can improve the model performance better. Finally, we synthesize about 100 million data and achieve comparable performance as the state of the art, which uses twice as much data as we use.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Dale R, Kilgarriff A. Helping our own: The hoo 2011 pilot shared task. In: Proceedings of the 13th European Workshop on Natural Language Generation, Association for Computational Linguistics. 2011, 242–249
Dale R, Anisimoff I, Narroway G. Hoo 2012: a report on the preposition and determiner error correction shared task. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics. 2012, 54–62
Ng H T, Wu S M, Wu Y, Hadiwinoto C, Tetreault J. The conll-2013 shared task on grammatical error correction. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2013, 1–12
Ng H T, Wu S M, Briscoe T, Hadiwinoto C, Susanto R H, Bryant C. The conll-2014 shared task on grammatical error correction. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 1–14
Brockett C, Dolan W B, Gamon M. Correcting esl errors using phrasal smt techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. 2006, 249–256
Chollampatt S, Ng H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
Chollampatt S, Ng H T. Neural quality estimation of grammatical error correction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2528–2539
Grundkiewicz R, Junczys-Dowmunt M. Near human-level performance in grammatical error correction with hybrid machine translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 284–290
Ge T, Wei F, Zhou M. Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 1055–1065
Mizumoto T, Komachi M, Nagata M, Matsumoto Y. Mining revision log of language learning sns for automated japanese error correction of second language learners. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 147–155
Dahlmeier D, Ng H T, Wu S M. Building a large annotated corpus of learner english: The nus corpus of learner english. In: Proceedings of the 8th workshop on innovative use of NLP for building educational applications. 2013, 22–31
Junczys-Dowmunt M, Grundkiewicz R, Guha S, Heafield K. Approaching neural grammatical error correction as a low-resource machine translation task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 595–606
Zhao W, Wang L, Shen K, Jia R, Liu J. Improving grammatical error correction via pre-training a copy-augmente d architecture with unlabeled data. arXiv, 1903.00138
Lichtarge J, Alberti C, Kumar S, Shazeer N, Parmar N, Tong S. Corpora generation for grammatical error correction. arXiv, 1904.05780
Xie Z, Avati A, Arivazhagan N, Jurafsky D, Ng A Y. Neural language correction with character-based attention. arXiv, 1603.09727
Xie Z, Genthial G, Xie S, Ng A, Jurafsky D. Noising and denoising natural language: Diverse backtranslation for grammar correction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 619–628
Felice M, Yuan Z, Andersen E, Yannakoudakis H, Kochmar E. Grammatical error correction using hybrid systems and type filtering. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 15–24
Junczys-Dowmunt M, Grundkiewicz R. The amu system in the conll-2014 shared task: Grammatical error correction by data-intensive and featurerich statistical machine translation. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 2014, 25–33
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 2007, 177–180
Chollampatt S, Ng H T. Connecting the dots: towards human-level grammatical error correction. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 327–333
Yuan Z, Briscoe T. Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 380–386
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017, 5998–6008
Yuan Z, Felice M. Constrained grammatical error correction using statistical machine translation. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task. 2013, 52–61
Rei M, Felice M, Yuan Z, Briscoe T. Artificial error generation with machine translation and syntactic patterns. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 287–292
Rozovskaya A, Roth D. Generating confusion sets for context-sensitive error correction. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2010, 961–970
Felice M, Yuan Z. Generating artificial errors for grammatical error correction. In: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014, 116–126
Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 86–96
Bryant C, Felice M, Briscoe E J. Automatic annotation and evaluation of error types for grammatical error correction. Association for Computational Linguistics, 2017
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 1312.3005
Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Association for Computational Linguistics. 2016, 1715–1725
Edunov S, Ott M, Auli M, Grangier D. Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 489–500
Dahlmeier D, Ng H T. Better evaluation for grammatical error correction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. 2012, 68–572
Fadaee M, Monz C. Back-translation sampling by targeting difficult words in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 436–446
Junczys-Dowmunt M, Grundkiewicz R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv, 1605.06353
Acknowledgements
This work was supported by the funds of Beijing Advanced Innovation Center for Language Resources (TYZ19005) and Research Program of State Language Commission (ZDI135-105, YB135-89).
Author information
Authors and Affiliations
Corresponding author
Additional information
Liner Yang received his PhD degree in computer science from Tsinghua University, China in 2018. He is currently a lecturer at the School of Information Sciences, Beijing Language and Culture University, China. His research interests include natural language processing and intelligent computer-assisted language learning.
Chengcheng Wang received his BS degree in computer science and technology from Beijing University of Technology, China in 2017, where he is currently pursuing his MS degree in computer science and technology. His research interests include natural language processing and grammatical error correction.
Yun Chen received her BS degree in microelectronics from Tsinghua University, China in 2013 and her PhD degree in electrical and electronic engineering from University of Hong Kong, China in 2018. She is broadly interested in machine learning and natural language processing, especially neural machine translation and pre-trained language models.
Yongping Du received her PhD degree in computer science from Fudan University, China in 2005. She is currently a professor in Beijing University of Technology, China. Her research interests include information retrieval, information extraction, and natural language processing.
Erhong Yang received her MS degree in computer science from Shanxi University, China in 1989, and her PhD degree in linguistics from the Beijing language and Culture University, China in 2005. She is the executive deputy director of Beijing Advanced Innovation Center for Language Resource, Beijing Language and Culture University, China. Her research interests include language resources, computational linguistics.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Yang, L., Wang, C., Chen, Y. et al. Controllable data synthesis method for grammatical error correction. Front. Comput. Sci. 16, 164318 (2022). https://doi.org/10.1007/s11704-020-0286-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-0286-4