Skip to main content

Neural Text Categorization with Transformers for Learning Portuguese as a Second Language

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12981))

Abstract

We report on the application of a neural network based approach to the problem of automatically categorizing texts according to their proficiency levels and suitability for learners of Portuguese as a second language. We resort to a particular deep learning architecture, namely Transformers, as we fine-tune GPT-2 and RoBERTa on data sets labeled with respect to the standard CEFR proficiency levels, that were provided by Camões IC, the Portuguese official language institute. Despite the reduced size of the data sets available, we found that the resulting models overperform previous carefully crafted feature based counterparts in most evaluation scenarios, thus offering a new state-of-the-art for this task in what concerns the Portuguese language.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.instituto-camoes.pt/.

  2. 2.

    https://portulanclarin.net/workbench/lx-proficiency.

  3. 3.

    https://huggingface.co/pierreguillou/gpt2-small-portuguese.

  4. 4.

    https://oscar-corpus.com/.

  5. 5.

    https://string.hlt.inesc-id.pt/demo/classification.pl.

  6. 6.

    https://logitboost.readthedocs.io/.

  7. 7.

    https://portulanclarin.net/workbench/lx-proficiency.

  8. 8.

    The PORTULAN CLARIN workbench comprises a number of tools that are based on a large body of research work contributed by different authors and teams, which continues to grow and is acknowledged here: [3,4,5,6,7, 10,11,12, 14, 27, 32, 33, 35, 40, 45].

References

  1. Aluisio, S., Specia, L., Gasperin, C., Scarton, C.: Readability assessment for text simplification. In: Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–9 (2010)

    Google Scholar 

  2. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)

    Google Scholar 

  3. Barreto, F. et al.: Open resources and tools for the shallow processing of Portuguese: the TagShare project. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pp. 1438–1443 (2006)

    Google Scholar 

  4. Branco, A., Henriques, T.: Aspects of verbal inflection and lemmatization: generalizations and algorithms. In: Proceedings of XVIII Annual Meeting of the Portuguese Association of Linguistics (APL), pp. 201–210 (2003)

    Google Scholar 

  5. Branco, A., Castro, S., Silva, J., Costa, F.: CINTIL DepBank handbook: Design options for the representation of grammatical dependencies. Technical report, University of Lisbon (2011)

    Google Scholar 

  6. Branco, A., et al.: Developing a deep linguistic databank supporting a collection of treebanks: the CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 1810–1815 (2010)

    Google Scholar 

  7. Branco, A., Nunes, F.: Verb analysis in a highly inflective language with an MFF algorithm. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS (LNAI), vol. 7243, pp. 1–11. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28885-2_1

    Chapter  Google Scholar 

  8. Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Assessing automatic text classification for interactive language learning. In: International Conference on Information Society (i-Society 2014), pp. 70–78 (2014)

    Google Scholar 

  9. Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Rolling out text categorization for language learning assessment supported by language technology. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS (LNAI), vol. 8775, pp. 256–261. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09761-9_29

    Chapter  Google Scholar 

  10. Branco, A., Rodrigues, J., Silva, J., Costa, F., Vaz, R.: Assessing automatic text classification for interactive language learning. In: Proceedings of the IEEE International Conference on Information Society (iSociety), pp. 72–80 (2014)

    Google Scholar 

  11. Branco, A., Silva, J.: A suite of shallow processing tools for Portuguese: LX-suite. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 179–182 (2006)

    Google Scholar 

  12. Costa, F., Branco, A.: Aspectual type and temporal relation classification. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 266–275 (2012)

    Google Scholar 

  13. Crossley, S.A., Skalicky, S., Dascalu, M., McNamara, D.S., Kyle, K.: Predicting text comprehension, processing, and familiarity in adult readers: new approaches to readability formulas. Discourse Process. 54, 340–359 (2017)

    Article  Google Scholar 

  14. Cruz, A.F., Rocha, G., Cardoso, H.L.: Exploring Spanish corpora for Portuguese coreference resolution. In: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 290–295 (2018)

    Google Scholar 

  15. Curto, P.: Classificador de textos para o ensino de português como segunda lıngua. Master’s thesis, Instituto Superior Técnico-Universidade de Lisboa, Lisboa (2014)

    Google Scholar 

  16. Curto, P., Mamede, N., Baptista, J.: Automatic text difficulty classifier. In: Proceedings of the 7th International Conference on Computer Supported Education, vol. 1, pp. 36–44 (2015)

    Google Scholar 

  17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  18. DuBay, W.H.: The Principles of Readability. Impact Information, Costa Mesa (2004)

    Google Scholar 

  19. Council for Europe, Council for Cultural Co-operation, E.C., Division, M.L.: Common European Framework of Reference for Languages: learning, teaching, assessment (2001)

    Google Scholar 

  20. Flesch, R.: How to Write Plain English: A Book for Lawyers and Consumers. Harpercollins, New York (1979)

    Google Scholar 

  21. Forti, L., Grego G., Santarelli, F., Santucci, V., Spina, S.: MALT-IT2: a new resource to measure text difficulty in light of CEFR levels for Italian l2 learning. In: 12th Language Resources and Evaluation Conference, pp. 7206–7213 (2020)

    Google Scholar 

  22. François, T., Fairon, C.: An “AI readability” formula for French as a foreign language. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 466–477 (2012)

    Google Scholar 

  23. Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. In: Learner Corpus Research, pp. 54–56 (2013)

    Google Scholar 

  24. Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: The Seventh Swedish Language Technology Conference (SLTC-18), Stockholm, Sweden, 7–9 November 2018 (2018)

    Google Scholar 

  25. Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  26. Martinc, M., Pollak, S., Robnik-Šikonja, M.: Supervised and unsupervised neural approaches to text readability (to be published)

    Google Scholar 

  27. Miranda, N., Raminhos, R., Seabra, P., Sequeira, J., Gonçalves, T., Quaresma, P.: Named entity recognition using machine learning techniques. In: EPIA-11, 15th Portuguese Conference on Artificial Intelligence, pp. 818–831 (2011)

    Google Scholar 

  28. Pilán, I., Volodina, E.: Investigating the importance of linguistic complexity features across different datasets related to language learning. In: Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing, pp. 49–58 (2018)

    Google Scholar 

  29. Radford, A., et al.: Better language models and their implications. OpenAI Blog (2019). https://openai.com/blog/better-language-models

  30. Reynolds, R.: Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories. In: Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 289–300 (2016)

    Google Scholar 

  31. del Río, I.: Automatic proficiency classification in l2 Portuguese. Procesamiento del Lenguaje Nat. 63, 67–74 (2019)

    Google Scholar 

  32. Rodrigues, J., Costa, F., Silva, J., Branco, A.: Automatic syllabification of Portuguese. Revista da Associação Portuguesa de Linguística (1), 715–720 (2020)

    Google Scholar 

  33. Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: distributional semantics models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 259–270. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_27

    Chapter  Google Scholar 

  34. Santini, M., Jönsson, A., Rennes, E.: Visualizing facets of text complexity across registers. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pp. 49–56 (2020)

    Google Scholar 

  35. Santos, R., Silva, J., Branco, A., Xiong, D.: The direct path may not be the best: Portuguese-Chinese neural machine translation. In: Proceedings of the 19th EPIA Conference on Artificial Intelligence, pp. 757–768 (2019)

    Google Scholar 

  36. Santos, R., et al.: Measuring the impact of readability features in fake news detection. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1404–1413 (2020)

    Google Scholar 

  37. Santucci, V., Santarelli, F., Forti, L., Spina, S.: Automatic classification of text complexity. Appl. Sci. 10, 7285 (2020)

    Article  Google Scholar 

  38. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997)

    Article  Google Scholar 

  39. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)

    Google Scholar 

  40. Silva, J., Branco, A., Castro, S., Reis, R.: Out-of-the-box robust parsing of Portuguese. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pp. 75–85 (2009)

    Google Scholar 

  41. Sung, Y.T., Lin, W.C., Dyson, S.B., Chang, K.E., Chen, Y.C.: Leveling l2 texts through readability: combining multilevel linguistic features with the CEFR. Mod. Lang. J. 99, 371–391 (2015)

    Article  Google Scholar 

  42. Tack, A., François, T., Desmet, P., Fairon, C.: NT2Lex: a CEFR-graded lexical resource for Dutch as a foreign language linked to open Dutch wordnet. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 137–146 (2018)

    Google Scholar 

  43. Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  45. Veiga, A., Candeias, S., Perdigão, F.: Generating a pronunciation dictionary for European Portuguese using a joint-sequence model with embedded stress assignment. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)

    Google Scholar 

  46. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  47. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)

    Google Scholar 

Download references

Acknowledgements

The work leading to the research results reported in this paper were mostly supported by Camões I.P. Instituto da Cooperação e da Língua. It was also partially supported by PORTULAN CLARIN Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rodrigo Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, R., Rodrigues, J., Branco, A., Vaz, R. (2021). Neural Text Categorization with Transformers for Learning Portuguese as a Second Language. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_56

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86230-5_56

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86229-9

  • Online ISBN: 978-3-030-86230-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics