Skip to main content
Log in

Controllable data synthesis method for grammatical error correction

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Due to the lack of parallel data in current grammatical error correction (GEC) task, models based on sequence to sequence framework cannot be adequately trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data. The first approach is to corrupt each word in the monolingual corpus with a fixed probability, including replacement, insertion and deletion. Another approach is to train error generation models and further filtering the decoding results of the models. The experiments on different synthetic data show that the error rate is 40% and that the ratio of error types is the same can improve the model performance better. Finally, we synthesize about 100 million data and achieve comparable performance as the state of the art, which uses twice as much data as we use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

  1. Dale R, Kilgarriff A. Helping our own: The hoo 2011 pilot shared task. In: Proceedings of the 13th European Workshop on Natural Language Generation, Association for Computational Linguistics. 2011, 242–249

  2. Dale R, Anisimoff I, Narroway G. Hoo 2012: a report on the preposition and determiner error correction shared task. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics. 2012, 54–62

  3. Ng H T, Wu S M, Wu Y, Hadiwinoto C, Tetreault J. The conll-2013 shared task on grammatical error correction. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2013, 1–12

  4. Ng H T, Wu S M, Briscoe T, Hadiwinoto C, Susanto R H, Bryant C. The conll-2014 shared task on grammatical error correction. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 1–14

  5. Brockett C, Dolan W B, Gamon M. Correcting esl errors using phrasal smt techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. 2006, 249–256

  6. Chollampatt S, Ng H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018

  7. Chollampatt S, Ng H T. Neural quality estimation of grammatical error correction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2528–2539

  8. Grundkiewicz R, Junczys-Dowmunt M. Near human-level performance in grammatical error correction with hybrid machine translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 284–290

  9. Ge T, Wei F, Zhou M. Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 1055–1065

  10. Mizumoto T, Komachi M, Nagata M, Matsumoto Y. Mining revision log of language learning sns for automated japanese error correction of second language learners. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 147–155

  11. Dahlmeier D, Ng H T, Wu S M. Building a large annotated corpus of learner english: The nus corpus of learner english. In: Proceedings of the 8th workshop on innovative use of NLP for building educational applications. 2013, 22–31

  12. Junczys-Dowmunt M, Grundkiewicz R, Guha S, Heafield K. Approaching neural grammatical error correction as a low-resource machine translation task. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 595–606

  13. Zhao W, Wang L, Shen K, Jia R, Liu J. Improving grammatical error correction via pre-training a copy-augmente d architecture with unlabeled data. arXiv, 1903.00138

  14. Lichtarge J, Alberti C, Kumar S, Shazeer N, Parmar N, Tong S. Corpora generation for grammatical error correction. arXiv, 1904.05780

  15. Xie Z, Avati A, Arivazhagan N, Jurafsky D, Ng A Y. Neural language correction with character-based attention. arXiv, 1603.09727

  16. Xie Z, Genthial G, Xie S, Ng A, Jurafsky D. Noising and denoising natural language: Diverse backtranslation for grammar correction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 619–628

  17. Felice M, Yuan Z, Andersen E, Yannakoudakis H, Kochmar E. Grammatical error correction using hybrid systems and type filtering. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics. 2014, 15–24

  18. Junczys-Dowmunt M, Grundkiewicz R. The amu system in the conll-2014 shared task: Grammatical error correction by data-intensive and featurerich statistical machine translation. In: Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 2014, 25–33

  19. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 2007, 177–180

  20. Chollampatt S, Ng H T. Connecting the dots: towards human-level grammatical error correction. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 327–333

  21. Yuan Z, Briscoe T. Grammatical error correction using neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, 380–386

  22. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017, 5998–6008

  23. Yuan Z, Felice M. Constrained grammatical error correction using statistical machine translation. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task. 2013, 52–61

  24. Rei M, Felice M, Yuan Z, Briscoe T. Artificial error generation with machine translation and syntactic patterns. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 2017, 287–292

  25. Rozovskaya A, Roth D. Generating confusion sets for context-sensitive error correction. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2010, 961–970

  26. Felice M, Yuan Z. Generating artificial errors for grammatical error correction. In: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014, 116–126

  27. Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 86–96

  28. Bryant C, Felice M, Briscoe E J. Automatic annotation and evaluation of error types for grammatical error correction. Association for Computational Linguistics, 2017

  29. Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 1312.3005

  30. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Association for Computational Linguistics. 2016, 1715–1725

  31. Edunov S, Ott M, Auli M, Grangier D. Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 489–500

  32. Dahlmeier D, Ng H T. Better evaluation for grammatical error correction. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. 2012, 68–572

  33. Fadaee M, Monz C. Back-translation sampling by targeting difficult words in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 436–446

  34. Junczys-Dowmunt M, Grundkiewicz R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv, 1605.06353

Download references

Acknowledgements

This work was supported by the funds of Beijing Advanced Innovation Center for Language Resources (TYZ19005) and Research Program of State Language Commission (ZDI135-105, YB135-89).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chengcheng Wang.

Additional information

Liner Yang received his PhD degree in computer science from Tsinghua University, China in 2018. He is currently a lecturer at the School of Information Sciences, Beijing Language and Culture University, China. His research interests include natural language processing and intelligent computer-assisted language learning.

Chengcheng Wang received his BS degree in computer science and technology from Beijing University of Technology, China in 2017, where he is currently pursuing his MS degree in computer science and technology. His research interests include natural language processing and grammatical error correction.

Yun Chen received her BS degree in microelectronics from Tsinghua University, China in 2013 and her PhD degree in electrical and electronic engineering from University of Hong Kong, China in 2018. She is broadly interested in machine learning and natural language processing, especially neural machine translation and pre-trained language models.

Yongping Du received her PhD degree in computer science from Fudan University, China in 2005. She is currently a professor in Beijing University of Technology, China. Her research interests include information retrieval, information extraction, and natural language processing.

Erhong Yang received her MS degree in computer science from Shanxi University, China in 1989, and her PhD degree in linguistics from the Beijing language and Culture University, China in 2005. She is the executive deputy director of Beijing Advanced Innovation Center for Language Resource, Beijing Language and Culture University, China. Her research interests include language resources, computational linguistics.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Wang, C., Chen, Y. et al. Controllable data synthesis method for grammatical error correction. Front. Comput. Sci. 16, 164318 (2022). https://doi.org/10.1007/s11704-020-0286-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-0286-4

Keywords