Abstract
We report on the application of a neural network based approach to the problem of automatically categorizing texts according to their proficiency levels and suitability for learners of Portuguese as a second language. We resort to a particular deep learning architecture, namely Transformers, as we fine-tune GPT-2 and RoBERTa on data sets labeled with respect to the standard CEFR proficiency levels, that were provided by Camões IC, the Portuguese official language institute. Despite the reduced size of the data sets available, we found that the resulting models overperform previous carefully crafted feature based counterparts in most evaluation scenarios, thus offering a new state-of-the-art for this task in what concerns the Portuguese language.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Aluisio, S., Specia, L., Gasperin, C., Scarton, C.: Readability assessment for text simplification. In: Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–9 (2010)
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Barreto, F. et al.: Open resources and tools for the shallow processing of Portuguese: the TagShare project. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pp. 1438–1443 (2006)
Branco, A., Henriques, T.: Aspects of verbal inflection and lemmatization: generalizations and algorithms. In: Proceedings of XVIII Annual Meeting of the Portuguese Association of Linguistics (APL), pp. 201–210 (2003)
Branco, A., Castro, S., Silva, J., Costa, F.: CINTIL DepBank handbook: Design options for the representation of grammatical dependencies. Technical report, University of Lisbon (2011)
Branco, A., et al.: Developing a deep linguistic databank supporting a collection of treebanks: the CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 1810–1815 (2010)
Branco, A., Nunes, F.: Verb analysis in a highly inflective language with an MFF algorithm. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS (LNAI), vol. 7243, pp. 1–11. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28885-2_1
Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Assessing automatic text classification for interactive language learning. In: International Conference on Information Society (i-Society 2014), pp. 70–78 (2014)
Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Rolling out text categorization for language learning assessment supported by language technology. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS (LNAI), vol. 8775, pp. 256–261. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09761-9_29
Branco, A., Rodrigues, J., Silva, J., Costa, F., Vaz, R.: Assessing automatic text classification for interactive language learning. In: Proceedings of the IEEE International Conference on Information Society (iSociety), pp. 72–80 (2014)
Branco, A., Silva, J.: A suite of shallow processing tools for Portuguese: LX-suite. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 179–182 (2006)
Costa, F., Branco, A.: Aspectual type and temporal relation classification. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 266–275 (2012)
Crossley, S.A., Skalicky, S., Dascalu, M., McNamara, D.S., Kyle, K.: Predicting text comprehension, processing, and familiarity in adult readers: new approaches to readability formulas. Discourse Process. 54, 340–359 (2017)
Cruz, A.F., Rocha, G., Cardoso, H.L.: Exploring Spanish corpora for Portuguese coreference resolution. In: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 290–295 (2018)
Curto, P.: Classificador de textos para o ensino de português como segunda lıngua. Master’s thesis, Instituto Superior Técnico-Universidade de Lisboa, Lisboa (2014)
Curto, P., Mamede, N., Baptista, J.: Automatic text difficulty classifier. In: Proceedings of the 7th International Conference on Computer Supported Education, vol. 1, pp. 36–44 (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
DuBay, W.H.: The Principles of Readability. Impact Information, Costa Mesa (2004)
Council for Europe, Council for Cultural Co-operation, E.C., Division, M.L.: Common European Framework of Reference for Languages: learning, teaching, assessment (2001)
Flesch, R.: How to Write Plain English: A Book for Lawyers and Consumers. Harpercollins, New York (1979)
Forti, L., Grego G., Santarelli, F., Santucci, V., Spina, S.: MALT-IT2: a new resource to measure text difficulty in light of CEFR levels for Italian l2 learning. In: 12th Language Resources and Evaluation Conference, pp. 7206–7213 (2020)
François, T., Fairon, C.: An “AI readability” formula for French as a foreign language. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 466–477 (2012)
Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. In: Learner Corpus Research, pp. 54–56 (2013)
Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: The Seventh Swedish Language Technology Conference (SLTC-18), Stockholm, Sweden, 7–9 November 2018 (2018)
Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martinc, M., Pollak, S., Robnik-Šikonja, M.: Supervised and unsupervised neural approaches to text readability (to be published)
Miranda, N., Raminhos, R., Seabra, P., Sequeira, J., Gonçalves, T., Quaresma, P.: Named entity recognition using machine learning techniques. In: EPIA-11, 15th Portuguese Conference on Artificial Intelligence, pp. 818–831 (2011)
Pilán, I., Volodina, E.: Investigating the importance of linguistic complexity features across different datasets related to language learning. In: Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing, pp. 49–58 (2018)
Radford, A., et al.: Better language models and their implications. OpenAI Blog (2019). https://openai.com/blog/better-language-models
Reynolds, R.: Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories. In: Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 289–300 (2016)
del Río, I.: Automatic proficiency classification in l2 Portuguese. Procesamiento del Lenguaje Nat. 63, 67–74 (2019)
Rodrigues, J., Costa, F., Silva, J., Branco, A.: Automatic syllabification of Portuguese. Revista da Associação Portuguesa de Linguística (1), 715–720 (2020)
Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: distributional semantics models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 259–270. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_27
Santini, M., Jönsson, A., Rennes, E.: Visualizing facets of text complexity across registers. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pp. 49–56 (2020)
Santos, R., Silva, J., Branco, A., Xiong, D.: The direct path may not be the best: Portuguese-Chinese neural machine translation. In: Proceedings of the 19th EPIA Conference on Artificial Intelligence, pp. 757–768 (2019)
Santos, R., et al.: Measuring the impact of readability features in fake news detection. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1404–1413 (2020)
Santucci, V., Santarelli, F., Forti, L., Spina, S.: Automatic classification of text complexity. Appl. Sci. 10, 7285 (2020)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Silva, J., Branco, A., Castro, S., Reis, R.: Out-of-the-box robust parsing of Portuguese. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pp. 75–85 (2009)
Sung, Y.T., Lin, W.C., Dyson, S.B., Chang, K.E., Chen, Y.C.: Leveling l2 texts through readability: combining multilevel linguistic features with the CEFR. Mod. Lang. J. 99, 371–391 (2015)
Tack, A., François, T., Desmet, P., Fairon, C.: NT2Lex: a CEFR-graded lexical resource for Dutch as a foreign language linked to open Dutch wordnet. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 137–146 (2018)
Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Veiga, A., Candeias, S., Perdigão, F.: Generating a pronunciation dictionary for European Portuguese using a joint-sequence model with embedded stress assignment. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Acknowledgements
The work leading to the research results reported in this paper were mostly supported by Camões I.P. Instituto da Cooperação e da Língua. It was also partially supported by PORTULAN CLARIN Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, R., Rodrigues, J., Branco, A., Vaz, R. (2021). Neural Text Categorization with Transformers for Learning Portuguese as a Second Language. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_56
Download citation
DOI: https://doi.org/10.1007/978-3-030-86230-5_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)