Abstract
REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing.
The research was carried out within the project of the HSE University Research Foundation 2021 - Automated analysis of text written in English by learners with Russian L1 (ADWISER).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bailler, N., Buzanov, A., Gaillat, T., Vinogradova, O.: A Cross-platform Investigation of Complexity for Russian Learners of English. EUROCALL 2021, presentation at the conference (2021)
Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. (2009)
Díaz-Negrillo, A., Valera, S., Meurers, D., Wunsch, H.: Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36, 139–154 (2010)
Education First. http://www.englishtown.com. Englishtown (2012)
Gablasova, D., Brezina, V., McEnery, T.: The trinity lancaster corpus: development, description and application. Int. J. Learner Corpus Res. 5(2), 126–158 (2019)
Geertzen, J., Alexopoulou, T., Korhonen, A.: Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In: Selected Proceedings of the 2012 Second Language Research Forum, Somerville, MA, USA (2013)
Gilquin, G.: Learner corpora. In: Paquot, M., Gries, S.T. (eds.) A Practical Handbook of Corpus Linguistics, pp. 283–303. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46216-1_13
Gilquin, G., de Cock, S., Granger, S.: The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM. Press. univ. de Louvain, Louvain-la-Neuve (2010)
Granger, S.: Learner corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, vol. 1, pp. 259–275. Walter de Gruyter, Berlin, New York (2008)
Granger S.: How to use foreign and second language learner corpora. In: Research Methods in Second Language Acquisition: A Practical Guide, ch. 2, pp. 5–29. Blackwell, Oxford (2012)
Granger S.: The contribution of learner corpora to reference and instructional materials design. In: The Cambridge Handbook of Learner Corpus Research, pp. 485–510. Cambridge University Press, Cambridge (2015)
Granger, S., Dupont, M., Meunier, F., Naets, H., Paquot, M.: The International Corpus of Learner English. Version 3. Press. univ. de Louvain, Louvain-la-Neuve (2020)
Huang Y., Geertzen, J., Baker, R., Korhonen, A., Alexopoulou, Th.: The EF Cambridge Open Language Database (EFCAMDAT): Information for Users, pp. 1–18. https://corpus.mml.cam.ac.uk (2017)
Lindquist, H.: Corpus Linguistics and the Description of English. Edinburgh University Press, Edinburgh (2009)
Lyashevskaya, O., Vinogradova, O., Panteleeva, I.: Automated assessment of learner text complexity. Assessing writing 49, 100529 (2021)
Lyashevskaya, O., Panteleeva, I.: REALEC learner treebank: annotation principles and evaluation of automatic parsing. In: TLT 16, pp. 80–87 (2017)
Lyashevskaya, O., Vinogradova, O., Scherbakova, A. Accuracy, syntactic complexity, and task type at play in examination writing: A corpus-based study (forthc.)
McEnery, T., Brezina, V., Gablasova, D., Banerjee, J.: Corpus linguistics, learner corpora, and SLA: employing technology to analyze language use. Ann. Rev. Appl. Linguistics 39, 74–92 (2019)
Meurers, D., Dickinson, M.: Evidence and interpretation in language learning research: opportunities for collaboration with computational linguistics. Lang. Learn. 67(1), 66–95 (2017)
Nesi, H.: ESP and corpus studies. In: Paltridge, B., Starfield, S. (eds.) The Handbook of English for Specific Purposes. Handbooks in Linguistics Series, pp. 407–426. Wiley-Blackwell, Oxford (2013)
Nicholls, D.: The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In: Proceedings of the Corpus Linguistics Conference, pp. 572–581. Lancaster University: University Centre for Computer Corpus Research on Language (2003)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK (1994)
Stenetorp, P., Pontus, P., Sampo, T., Goran, O., Tomoko, A. Tsujii, J.-I.: BRAT: A Web-based Tool for NLP-Assisted Text Annotation. In: EACL 13, Demonstrations, pp. 102–107. Stroudshourg, PA (2012)
Tetreault, J.R., Filatova, E., Chodorow, M.: Rethinking grammatical error annotation and evaluation with the amazon mechanical Turk. In: Proceedings of the Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 45–49. Los Angeles, USA (2010)
Vinogradova, O.: The Role and Applications of Expert Error Annotation in a Corpus of English Learner Texts. In: Computational Linguistics and Intellectual Technologies: Proceedings of Dialog 2016, pp. 740–751. Moscow, Russia (2016)
Vinogradova, O., Ershova, E., Sergienko, A., Overnikova, D., Buzanov, A.: Chaos is merely order waiting to be deciphered: corpus-based study of word order errors of Russian learners of English. In: Learner Corpus Research Conference, p. 115. Warsaw (2019)
Vinogradova, O., Smirnova, E. The L1 influence on the use of the English present perfect: a corpus analysis of Russian and Spanish learner essays (forthc.)
HSE Independent English Language Test regulations page. https://www.hse.ru/en/studyspravka/indexam. Accessed 20 Apr 2022
Learner Corpus Association page. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html. Accessed 5 May 2022
REALEC homepage. https://realec.org/index.xhtml#/exam. Accessed 5 May 2022
SpaCy homepage. https://spacy.io. Accessed 5 May 2022
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Vinogradova, O., Lyashevskaya, O. (2022). Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-16270-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)