Hostname: page-component-8448b6f56d-xtgtn Total loading time: 0 Render date: 2024-04-23T07:41:18.544Z Has data issue: false hasContentIssue false

Non-native text analysis: A survey

Published online by Cambridge University Press:  07 September 2015

SEAN MASSUNG
Affiliation:
Department of Computer Science, College of Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA e-mail: massung1@illinois.edu, czhai@illinois.edu
CHENGXIANG ZHAI
Affiliation:
Department of Computer Science, College of Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois, USA e-mail: massung1@illinois.edu, czhai@illinois.edu

Abstract

Non-native speakers of English far outnumber native speakers; English is the main language of books, newspapers, airports, air-traffic control, international business, academic conferences, science, technology, diplomacy, sports, international competitions, pop music, and advertising (British Council 2014). Online education in the form of massive online open courses is also primarily in English—even teaching English. This creates enormous amounts of text written by non-native speakers, which in turn generates a need for grammar correction and analysis. Even aside from massive online open courses, the number of English learners in Asia alone is in the tens of millions. In this paper, we provide a survey of the two main areas of existing work on non-native text analysis, prefaced by an overview of common datasets used by researchers, comparing their attributes and potential uses. Then, an introduction to native language identification follows: determining the native language of an author based on text in the second language. This section is subdivided into various techniques and a shared task on this classification problem. Next, we discuss non-native grammatical error correction—finding and modifying text to fix errors or to make it sound more fluent. Again, we discuss different methods before investigating a relevant shared task. Lastly, we end with conclusions and potential future directions. While this survey primarily focuses on detecting and correcting non-native English text, many approaches are general and can be used across any language pairing.

Type
Survey Paper
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Ando, R., and Zhang, T. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6: 1817–53.Google Scholar
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., and Chodorow, M. 2013. TOEF11: a corpus of non-native English. Technical Report, Educational Testing Service (ETS). (https://www.ets.org/)CrossRefGoogle Scholar
Blei, D., Ng, A., and Jordan, M. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 9931022.Google Scholar
Blunsom, P., and Cohn, T. 2010. Unsupervised induction of tree substitution grammars for dependency parsing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, Massachusetts, USA, pp. 1204–13.Google Scholar
Bobicev, V. 2013. Native language identification with PPM. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 180–87.Google Scholar
Boisson, J., Kao, T., Wu, J., Yen, T., and Chang, J. 2013. Linggle: a web-scale linguistic search engine for words in context. In Proceedings of Association for Computational Linguistics (Conference System Demonstrations), Sofia, Bulgaria, pp. 139–44.Google Scholar
Brockett, C., Dolan, W., and Gamon, M. 2006. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 249–56.Google Scholar
Brooke, J., and Hirst, G. 2011. Native language detection with ‘Cheap’ learner corpora. In Proceedings of the 2011 Conference on Learner Corpus Research, Louvain-la-Neuve, Belgium, pp. 3747.Google Scholar
Brooke, J., and Hirst, G. 2012. Robust, lexicalized native language identification. In Proceedings of the International Conference on Computational Linguistics, Mumbai, India, pp. 391408.Google Scholar
Bykh, S., and Meurers, D. 2014. Exploring syntactic features for native language identification: a variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1962–73.Google Scholar
Cahill, A., Madnani, N., Tetreault, J., and Napolitano, D. 2013. Robust Systems for Preposition Error Correction Using Wikipedia Revisions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 507–17.Google Scholar
Chodorow, M., Dickinson, M., Israel, R., and Tetreault, J. 2012. Problems in evaluating grammatical error detection systems. In Proceedings of COLING 2012, Mumbai, India, pp. 611–28.Google Scholar
Corston-Oliver, S., Gamon, M., and Brockett, C. 2001. A machine learning approach to the automatic evaluation of machine translation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Toulouse, France, pp. 148–55.Google Scholar
Dahlmeier, D., and Ng, H. 2011a. Correcting Semantic Collocation Errors with L1-induced Paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 107–17.Google Scholar
Dahlmeier, D., and Ng, H. 2011b. Grammatical error correction with alternating structure optimization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, Portland, Oregon, USA, pp. 915–23Google Scholar
Dahlmeier, D., Ng, H., and Wu, S. 2013. Building a large annotated corpus of learner english: the NUS corpus of learner english. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 2231.Google Scholar
Felice, M., Yuan, Z., Andersen, Ø., Yannakoudakis, H., and Kochmar, E. 2014. Grammatical error correction using hybrid systems and type filtering. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Baltimore, Maryland, USA, pp. 1524.Google Scholar
Gamon, M. 2010. Using mostly native data to correct errors in Learners’ writing: a meta-classifier approach. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, USA, pp. 163–71.Google Scholar
Gamon, M., Aue, A., and Smets, M. 2005. Sentence-level MT evaluation without reference translations: beyond language modeling. In Proceedings of the European Association for Machine Translation (EAMT), Budapest, Hungary, pp. 103–11.Google Scholar
Granger, S. 2003. The international corpus of learner english: a new resource for foreign language learning and teaching and second language acquisition research. In Teachers of English to Speakers of Other Languages Quarterly, pp. 538–46.Google Scholar
Granger, S., Dagneaux, E., Meunier, F., and Paquot, M. 2009. The International Learner Corpus of English, Version 2. Presses Universitaires de Louvain.Google Scholar
Gui, S., and Yang, H. 2003. Zhongguo Xuexizhe Yingyu Yuliaohu (Chinese Learner English Corpus). In Shanghai Waiyu Jiaoyu Chubanshe.Google Scholar
Guthrie, D., Allison, B., Liu, W., Guthrie, L., and Wilks, Y. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, pp. 101–11.Google Scholar
Ionescu, R., Popescu, M., and Cahill, A. 2014. Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1363–73.Google Scholar
Ishikawa, S. 2009. Vocabulary in interlanguage: a study on corpus of english essays written by Asian university students (CEEAUS). In Phraseology: Corpus Linguistics and Lexicology, pp. 87–100.Google Scholar
Ishikawa, S. 2013. The ICNALE and sophisticated contrastive interlanguage analysis of Asian learners of english. In Learner Corpus Studies in Asia and the World, pp. 91–118.Google Scholar
Jarvis, S., Bestgen, Y., and Pepper, S. 2013. Maximizing classification accuracy in native language identification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 111–18.Google Scholar
Johnson, M., Griffiths, T., and Goldwater, S. 2006. Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. In Neural Information Processing Systems, pp. 641–8.Google Scholar
Kao, T., Chang, Y., Chiu, H., Yen, T., Boisson, J., Wu, J., and Chang, J. 2013. CoNLL-2013 shared task: grammatical error correction NTHU system description. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 2025.Google Scholar
Koppel, M., Schler, J., and Argamon, S. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 9–26.CrossRefGoogle Scholar
Koppel, M., Schler, J., and Zigdon, K. 2005. Determining an author's native language by mining a text for errors. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA, pp. 624–8.Google Scholar
Kukich, K. 1992. Techniques for automatically correcting words in text. In ACM Computing Surveys, pp. 377–439.Google Scholar
Kulkarni, C., Wei, K. P., Le, H., Chia, D., Papadopoulos, K., Cheng, J., Koller, D., and Klemmer, S. R. 2013. Peer and Self Assessment in Massive Online Classes. ACM Trans. Comput.-Hum. Interact. 20 (6): 33:1–33:31. ISSN .CrossRefGoogle Scholar
Kyle, K., Crossley, S., Dai, J., and McNamara, D. 2013. Native language identification: a key N-gram category approach. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 242–50.Google Scholar
Lavergne, T., Illouz, G., Max, A., and Nagata, R. 2013. LIMSI’s participation to the 2013 shared task on Native Language Identification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 260–65.Google Scholar
Leacock, C., Chodorow, M., Gamon, M., Tetreault, J. 2014. Automated Grammatical Error Detection for Language Learners, 2nd ed. In G. Hirst (ed.), Morgan and Claypool (Synthesis lectures on human language technologies).CrossRefGoogle Scholar
Lee, J., and Seneff, S. 2006. Automatic grammar correction for second-language learners. In Proceedings of the 9th International Conference on Spoken Language Processing, Pittsburgh, Pennsylvania, USA, pp. 1978–81.Google Scholar
Madnani, N., Tetreault, J., and Chodorow, M. 2012. Exploring grammatical error correction with not-so-crummy machine translation. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Montreal, Canada, pp. 4453.Google Scholar
Massung, S., Zhai, C., and Hockenmaier, J. 2013. Structural parse tree features for text representation. In Proceedings of the International Conference on Semantic Computing, Irvine, California, USA, pp. 916.Google Scholar
Ng, H., Wu, S., Briscoe, T., Hadiwinoto, C., Sustano, R., and Bryant, C. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Ann Arbor, Michigan, USA, pp. 114.Google Scholar
Ng, H., Wu, S., Wu, Y., Hadiwinoto, C., and Tetreault, J. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 112.Google Scholar
Popescu, M., and Ionescu, T. 2013. The story of the characters, the DNA and the native language. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 270–78.Google Scholar
Rangel, F., Rosso, F., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., and Daelemans, W. 2014. Overview of the 2nd author profiling task at PAN 2014. In Proceedings of the Conference and Labs of the Evaluation Forum (Working Notes), Sheffield, England, UK.Google Scholar
Rozovskaya, A., and Roth, D. 2010. Training paradigms for correcting errors in grammar and usage. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, pp. 154–62.Google Scholar
Rozovskaya, A., and Roth, D. 2013. Joint learning and inference for grammatical error correction. In Proceedings of Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 791802.Google Scholar
Rozovskaya, A., Chang, K., Sammons, M., and Roth, D. 2013. The University of Illinois system in the CoNLL-2013 shared task. In Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 1319.Google Scholar
Rozovskaya, A., Chang, K., Sammons, M., Roth, D., and Habash, N. 2014. The Illinois-Columbia system in the CoNLL-2014 shared task. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task, Ann Arbor, Michigan, USA, 3442.Google Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60: 538–56.CrossRefGoogle Scholar
Stolerman, A., Caliskan, A., and Greenstadt, R. 2013. From language to family and back: native language and language family identification from english text. In Proceedings of the 2013 NAACL HLT Student Research Workshop, Atlanta, Georgia, USA, pp. 32–9.Google Scholar
Swanson, B., and Charniak, E. 2012. Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, Jeju Island, Korea, pp. 193–97.Google Scholar
Tajiri, T., Komachi, M., and Matsumoto, Y. 2012. Tense and aspect error correction for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 198202.Google Scholar
Tetreault, J., Blanchard, D., and Cahill, A. 2013. A report on the first native language identification shared task. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 4857.Google Scholar
Tsur, O., and Rappoport, A. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, Prague, Czech Republic, pp. 916.Google Scholar
Tsvetkov, Y., Twitto, N., Schneider, N., Ordan, N., Faruqui, M., Chahuneau, V., Wintner, S., and Dyer, C. 2013. Identifying the L1 of non-native writers: the CMU-Haifa System. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, USA, pp. 279–87.Google Scholar
West, R., Park, A., and Levy, R. 2011. Bilingual random walk models for automated grammar correction of ESL author-produced text. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, Portland, Oregon, USA, pp. 170–79.Google Scholar
Wong, J., and Dras, M. 2009. Contrastive analysis and native language identification. In Australasian Language Technology Association Workshop 2009, pp. 53–61.Google Scholar
Wong, J., and Dras, M. 2011. Exploiting parse structures for native language identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 1600–10.Google Scholar
Wong, J., Dras, M., and Johnson, M. 2012. Exploring adaptor grammars for native language identification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 699709.Google Scholar
Xiang, Y., Yuan, B., Zhang, Y., Wang, X., Zheng, W., and Wei, C. 2013. A hybrid model for grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, Sofia, Bulgaria, pp. 115–22.Google Scholar
Yannakoudakis, H., Briscoe, T., and Medlock, B. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 180–89.Google Scholar
Yu, L., Lee, L., and Chang, L. 2014. Overview of grammatical error diagnosis for learning Chinese as a foreign language. In Proceedings of the 22nd International Conference on Computers in Education, Nara, Japan, pp. 42–7.Google Scholar
Zheng, Z., Wu, X., and Srihari, R. 2004. Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explor. Newsl. 6 (1): 8089. ISSN .CrossRefGoogle Scholar