Self-feeding training method for semi-supervised grammatical error correction

https://doi.org/10.1016/j.csl.2022.101435Get rights and content

Highlights

  • Self-feeding training method can help training grammatical error correction models.

  • Data generation models trained with unlabeled data can generate realistic data.

  • Denoising artificial noise generates various wrong data pairs for training.

  • This method can be extended to domains and languages without labeled data.

Abstract

Grammatical error correction (GEC) has been successful with deep and complex neural machine translation models, but the annotated data to train the model are scarce. We propose a novel self-feeding training method that generates incorrect sentences from freely available correct sentences. The proposed training method can generate appropriate wrong sentences from unlabeled sentences, using a data generation model trained as an autoencoder. It can also add artificial noise to correct sentences to automatically generate incorrect sentences. We show that the GEC models trained with the self-feeding training method are successful without extra annotated data or deeper neural network-based models, achieving F0.5 score of 0.5982 on the CoNLL-2014 Shared Task test data with a transformer model. The results also show that fully unlabeled training is possible for data-scarce domains and languages.

Introduction

Grammatical error correction (GEC) is a task to correct possible grammatical errors in a sentence. GEC can be directly used for multiple services such as writing assistants, dialog systems, and language learning. The task has gained interest and has been studied for decades, but the task is still challenging.

Neural machine translation (NMT)-based GEC studies have surpassed the previous statistical machine translation (SMT)-based systems. The NMT approach regards the GEC task as translating the erroneous language into a correct one. The NMT approach has been successful using deeper and more complex neural network structures, from recurrent neural networks to convolutional neural networks and transformers (Yuan and Briscoe, 2016, Chollampatt and Ng, 2018, Zhao et al., 2019, Katsumata and Komachi, 2020, Rothe et al., 2021).

The NMT-based GEC systems have become more complex rapidly, but few annotated datasets to train the deep models have been published. The supervised training data for GEC are pairs of grammatically incorrect sentences written by language learners (e.g., English-learning students) and correct sentences edited and annotated by language experts (e.g., English teachers). GEC annotated datasets, including the NUS Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013), Lang-8 data (Mizumoto et al., 2011), CLC FCE dataset (Yannakoudakis et al., 2011), Cambridge English Write & Improve (W&I) and LOCNESS corpus (Bryant et al., 2019) have been opened to the public. However, the sizes and variety of the datasets are incomparable to other tasks, such as machine translation, and they are too small to train the deep and complex NMT models without overfitting, as such annotations are expensive and time-consuming to collect. Some GEC researches use their additional closed annotated data to improve their models, however such results are not reproducible for other research groups.

Compared to annotated data pairs, correct sentences without any annotations are easy to collect from news articles, books, documents or web pages. Some studies tried to synthesize parallel data from those sentences. Open synthetic datasets, such as C4200M (Stahlberg and Kumar, 2021), or reproducible methods to build them can help other groups train and research GEC models.

To alleviate the data-scarce problem of GEC, we propose a self-feeding training method that incorporates semi-supervised learning and paired data generation using unannotated sentences. Our method is able to generate paired training data from unlabeled sentences with a data generation model. Furthermore, our novel method can select appropriate samples with simple but effective rules, learn to recover wrong words from noisy sentences, and effectively utilize the generated data to train the GEC model. This method can yield multiple erroneous sentences from one unlabeled sentence without additional annotated data or large-scale language model trained with massive data. We experimented the proposed method with multiple semi-supervised training methods on a transformer-based GEC models. The results show that our proposed method is effective for training previous GEC models without extra labeled data or changes to the previous NMT model structure.

Section snippets

Related studies

Previous GEC models used rule-based and SMT approaches. These GEC approaches were the best during the CoNLL-2014 Shared Task on Grammatical Error Correction competition (Ng et al., 2014). The system performed the best by automatically derived rules and SMT model to generate correction candidates and ranked combinations of candidates using language models (Felice et al., 2014). The second-best model was a combination of multiple individual classifiers to select the best edits within the

Self-feeding training

This section explains our proposed method of self-feeding training which can generate labeled data to train the GEC model, to solve an important problem of the GEC task of its limited amount of publicly available training data.

We assume that annotated incorrect–correct sentences paired dataset for supervised training is D=(x,y)n, and the size of the dataset n is limited and not sufficient to train the complex and deep NMT-based GEC models. However, unannotated correct sentences D˜=(ỹ)N without

Semi-supervised training of self-feeding model

The generated data are used with the given labeled data for the self-feeding training. The more the data, the better the model; however, the human-annotated data and automatically generated data are different in their generation methods, quantities and qualities. Therefore, semi-supervised training methods to train the model with two data are also an important field of study. We experimented with multiple semi-supervised training methods to find which semi-supervised method helps to improve the

Dataset

We used multiple labeled, unlabeled, and test datasets with different sizes, domains, and qualities. The datasets will be introduced briefly and important features are shown in Table 2.

We used four sets of annotated data to train the GEC model: NUCLE (Dahlmeier et al., 2013), CLC FCE (Yannakoudakis et al., 2011), Cambridge English Write & Improve and LOCNESS (W&I+LOCNESS) corpus (Bryant et al., 2019), and Lang-8 data (Mizumoto et al., 2011). NUCLE, CLC FCE, and W&I+LOCNESS sentences are from

Conclusions

We proposed a novel self-feeding training method, incorporating unlabeled data to use their maximum potential utility for training GEC models. This method is proved successful for the GEC task without additional closed annotated dataset or up-scaling the size of the neural network model. Moreover, the proposed methods are effective even without any annotated data and can be used to train the GEC model for domains and languages with small or no training data.

We experimented with rather shallow

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was partly supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2021-2020-0-01789) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and IITP grant funded by the Korea government (MSIT) (2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH))

References (31)

  • BengioY. et al.

    Advances in optimizing recurrent networks

  • Bryant, C., Felice, M., Andersen, Ø.E., Briscoe, T., 2019. The BEA-2019 shared task on grammatical error correction....
  • Bryant, C., Felice, M., Briscoe, T., 2017. Automatic Annotation and Evaluation of Error Types for Grammatical Error...
  • Chollampatt, S., Ng, H.T., 2018. A multilayer convolutional encoder-decoder neural network for grammatical error...
  • Dahlmeier, D., Ng, H.T., 2012. Better evaluation for grammatical error correction. In: Proceedings of the 2012...
  • Dahlmeier, D., Ng, H.T., Wu, S.M., 2013. Building a large annotated corpus of learner English: The NUS corpus of...
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • Felice, M., Yuan, Z., Andersen, Ø.E., Yannakoudakis, H., Kochmar, E., 2014. Grammatical error correction using hybrid...
  • Flachs, S., Stahlberg, F., Kumar, S., 2021. Data Strategies for Low-Resource Grammatical Error Correction. In:...
  • Ge, T., Wei, F., Zhou, M., 2018. Fluency boost learning and inference for neural grammatical error correction. In:...
  • Grundkiewicz, R., Junczys-Dowmunt, M., Heafield, K., 2019. Neural grammatical error correction systems with...
  • Heilman, M., Cahill, A., Madnani, N., Lopez, M., Mulholland, M., Tetreault, J., 2014. Predicting grammaticality on an...
  • HermannK.M. et al.

    Teaching machines to read and comprehend

    (2015)
  • KanekoM. et al.

    Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction

    (2020)
  • KatsumataS. et al.

    Stronger baselines for grammatical error correction using pretrained encoder-decoder model

    (2020)
  • Cited by (1)

    View full text