Skip to main content

Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2022)

Abstract

REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing.

The research was carried out within the project of the HSE University Research Foundation 2021 - Automated analysis of text written in English by learners with Russian L1 (ADWISER).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bailler, N., Buzanov, A., Gaillat, T., Vinogradova, O.: A Cross-platform Investigation of Complexity for Russian Learners of English. EUROCALL 2021, presentation at the conference (2021)

    Google Scholar 

  2. Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. (2009)

    Google Scholar 

  3. Díaz-Negrillo, A., Valera, S., Meurers, D., Wunsch, H.: Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36, 139–154 (2010)

    Google Scholar 

  4. Education First. http://www.englishtown.com. Englishtown (2012)

  5. Gablasova, D., Brezina, V., McEnery, T.: The trinity lancaster corpus: development, description and application. Int. J. Learner Corpus Res. 5(2), 126–158 (2019)

    Google Scholar 

  6. Geertzen, J., Alexopoulou, T., Korhonen, A.: Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In: Selected Proceedings of the 2012 Second Language Research Forum, Somerville, MA, USA (2013)

    Google Scholar 

  7. Gilquin, G.: Learner corpora. In: Paquot, M., Gries, S.T. (eds.) A Practical Handbook of Corpus Linguistics, pp. 283–303. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46216-1_13

    Chapter  Google Scholar 

  8. Gilquin, G., de Cock, S., Granger, S.: The Louvain International Database of Spoken English Interlanguage. Handbook and CD-ROM. Press. univ. de Louvain, Louvain-la-Neuve (2010)

    Google Scholar 

  9. Granger, S.: Learner corpora. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, vol. 1, pp. 259–275. Walter de Gruyter, Berlin, New York (2008)

    Google Scholar 

  10. Granger S.: How to use foreign and second language learner corpora. In: Research Methods in Second Language Acquisition: A Practical Guide, ch. 2, pp. 5–29. Blackwell, Oxford (2012)

    Google Scholar 

  11. Granger S.: The contribution of learner corpora to reference and instructional materials design. In: The Cambridge Handbook of Learner Corpus Research, pp. 485–510. Cambridge University Press, Cambridge (2015)

    Google Scholar 

  12. Granger, S., Dupont, M., Meunier, F., Naets, H., Paquot, M.: The International Corpus of Learner English. Version 3. Press. univ. de Louvain, Louvain-la-Neuve (2020)

    Google Scholar 

  13. Huang Y., Geertzen, J., Baker, R., Korhonen, A., Alexopoulou, Th.: The EF Cambridge Open Language Database (EFCAMDAT): Information for Users, pp. 1–18. https://corpus.mml.cam.ac.uk (2017)

  14. Lindquist, H.: Corpus Linguistics and the Description of English. Edinburgh University Press, Edinburgh (2009)

    Google Scholar 

  15. Lyashevskaya, O., Vinogradova, O., Panteleeva, I.: Automated assessment of learner text complexity. Assessing writing 49, 100529 (2021)

    Article  Google Scholar 

  16. Lyashevskaya, O., Panteleeva, I.: REALEC learner treebank: annotation principles and evaluation of automatic parsing. In: TLT 16, pp. 80–87 (2017)

    Google Scholar 

  17. Lyashevskaya, O., Vinogradova, O., Scherbakova, A. Accuracy, syntactic complexity, and task type at play in examination writing: A corpus-based study (forthc.)

    Google Scholar 

  18. McEnery, T., Brezina, V., Gablasova, D., Banerjee, J.: Corpus linguistics, learner corpora, and SLA: employing technology to analyze language use. Ann. Rev. Appl. Linguistics 39, 74–92 (2019)

    Article  Google Scholar 

  19. Meurers, D., Dickinson, M.: Evidence and interpretation in language learning research: opportunities for collaboration with computational linguistics. Lang. Learn. 67(1), 66–95 (2017)

    Article  Google Scholar 

  20. Nesi, H.: ESP and corpus studies. In: Paltridge, B., Starfield, S. (eds.) The Handbook of English for Specific Purposes. Handbooks in Linguistics Series, pp. 407–426. Wiley-Blackwell, Oxford (2013)

    Google Scholar 

  21. Nicholls, D.: The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In: Proceedings of the Corpus Linguistics Conference, pp. 572–581. Lancaster University: University Centre for Computer Corpus Research on Language (2003)

    Google Scholar 

  22. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK (1994)

    Google Scholar 

  23. Stenetorp, P., Pontus, P., Sampo, T., Goran, O., Tomoko, A. Tsujii, J.-I.: BRAT: A Web-based Tool for NLP-Assisted Text Annotation. In: EACL 13, Demonstrations, pp. 102–107. Stroudshourg, PA (2012)

    Google Scholar 

  24. Tetreault, J.R., Filatova, E., Chodorow, M.: Rethinking grammatical error annotation and evaluation with the amazon mechanical Turk. In: Proceedings of the Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 45–49. Los Angeles, USA (2010)

    Google Scholar 

  25. Vinogradova, O.: The Role and Applications of Expert Error Annotation in a Corpus of English Learner Texts. In: Computational Linguistics and Intellectual Technologies: Proceedings of Dialog 2016, pp. 740–751. Moscow, Russia (2016)

    Google Scholar 

  26. Vinogradova, O., Ershova, E., Sergienko, A., Overnikova, D., Buzanov, A.: Chaos is merely order waiting to be deciphered: corpus-based study of word order errors of Russian learners of English. In: Learner Corpus Research Conference, p. 115. Warsaw (2019)

    Google Scholar 

  27. Vinogradova, O., Smirnova, E. The L1 influence on the use of the English present perfect: a corpus analysis of Russian and Spanish learner essays (forthc.)

    Google Scholar 

  28. HSE Independent English Language Test regulations page. https://www.hse.ru/en/studyspravka/indexam. Accessed 20 Apr 2022

  29. Learner Corpus Association page. https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html. Accessed 5 May 2022

  30. REALEC homepage. https://realec.org/index.xhtml#/exam. Accessed 5 May 2022

  31. SpaCy homepage. https://spacy.io. Accessed 5 May 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olga Lyashevskaya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vinogradova, O., Lyashevskaya, O. (2022). Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16270-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16269-5

  • Online ISBN: 978-3-031-16270-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics