Neural Text Categorization with Transformers for Learning Portuguese as a Second Language

Santos, Rodrigo; Rodrigues, João; Branco, António; Vaz, Rui

doi:10.1007/978-3-030-86230-5_56

Neural Text Categorization with Transformers for Learning Portuguese as a Second Language

Rodrigo Santos¹³,
João Rodrigues¹³,
António Branco¹³ &
…
Rui Vaz¹⁴

Conference paper
First Online: 03 September 2021

1870 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12981))

Abstract

We report on the application of a neural network based approach to the problem of automatically categorizing texts according to their proficiency levels and suitability for learners of Portuguese as a second language. We resort to a particular deep learning architecture, namely Transformers, as we fine-tune GPT-2 and RoBERTa on data sets labeled with respect to the standard CEFR proficiency levels, that were provided by Camões IC, the Portuguese official language institute. Despite the reduced size of the data sets available, we found that the resulting models overperform previous carefully crafted feature based counterparts in most evaluation scenarios, thus offering a new state-of-the-art for this task in what concerns the Portuguese language.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.instituto-camoes.pt/.
2.
https://portulanclarin.net/workbench/lx-proficiency.
3.
https://huggingface.co/pierreguillou/gpt2-small-portuguese.
4.
https://oscar-corpus.com/.
5.
https://string.hlt.inesc-id.pt/demo/classification.pl.
6.
https://logitboost.readthedocs.io/.
7.
https://portulanclarin.net/workbench/lx-proficiency.
8.
The PORTULAN CLARIN workbench comprises a number of tools that are based on a large body of research work contributed by different authors and teams, which continues to grow and is acknowledged here: [3,4,5,6,7, 10,11,12, 14, 27, 32, 33, 35, 40, 45].

References

Aluisio, S., Specia, L., Gasperin, C., Scarton, C.: Readability assessment for text simplification. In: Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–9 (2010)
Google Scholar
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Google Scholar
Barreto, F. et al.: Open resources and tools for the shallow processing of Portuguese: the TagShare project. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pp. 1438–1443 (2006)
Google Scholar
Branco, A., Henriques, T.: Aspects of verbal inflection and lemmatization: generalizations and algorithms. In: Proceedings of XVIII Annual Meeting of the Portuguese Association of Linguistics (APL), pp. 201–210 (2003)
Google Scholar
Branco, A., Castro, S., Silva, J., Costa, F.: CINTIL DepBank handbook: Design options for the representation of grammatical dependencies. Technical report, University of Lisbon (2011)
Google Scholar
Branco, A., et al.: Developing a deep linguistic databank supporting a collection of treebanks: the CINTIL DeepGramBank. In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pp. 1810–1815 (2010)
Google Scholar
Branco, A., Nunes, F.: Verb analysis in a highly inflective language with an MFF algorithm. In: Caseli, H., Villavicencio, A., Teixeira, A., Perdigão, F. (eds.) PROPOR 2012. LNCS (LNAI), vol. 7243, pp. 1–11. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28885-2_1
Chapter Google Scholar
Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Assessing automatic text classification for interactive language learning. In: International Conference on Information Society (i-Society 2014), pp. 70–78 (2014)
Google Scholar
Branco, A., Rodrigues, J., Costa, F., Silva, J., Vaz, R.: Rolling out text categorization for language learning assessment supported by language technology. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS (LNAI), vol. 8775, pp. 256–261. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09761-9_29
Chapter Google Scholar
Branco, A., Rodrigues, J., Silva, J., Costa, F., Vaz, R.: Assessing automatic text classification for interactive language learning. In: Proceedings of the IEEE International Conference on Information Society (iSociety), pp. 72–80 (2014)
Google Scholar
Branco, A., Silva, J.: A suite of shallow processing tools for Portuguese: LX-suite. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 179–182 (2006)
Google Scholar
Costa, F., Branco, A.: Aspectual type and temporal relation classification. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 266–275 (2012)
Google Scholar
Crossley, S.A., Skalicky, S., Dascalu, M., McNamara, D.S., Kyle, K.: Predicting text comprehension, processing, and familiarity in adult readers: new approaches to readability formulas. Discourse Process. 54, 340–359 (2017)
Article Google Scholar
Cruz, A.F., Rocha, G., Cardoso, H.L.: Exploring Spanish corpora for Portuguese coreference resolution. In: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 290–295 (2018)
Google Scholar
Curto, P.: Classificador de textos para o ensino de português como segunda lıngua. Master’s thesis, Instituto Superior Técnico-Universidade de Lisboa, Lisboa (2014)
Google Scholar
Curto, P., Mamede, N., Baptista, J.: Automatic text difficulty classifier. In: Proceedings of the 7th International Conference on Computer Supported Education, vol. 1, pp. 36–44 (2015)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
DuBay, W.H.: The Principles of Readability. Impact Information, Costa Mesa (2004)
Google Scholar
Council for Europe, Council for Cultural Co-operation, E.C., Division, M.L.: Common European Framework of Reference for Languages: learning, teaching, assessment (2001)
Google Scholar
Flesch, R.: How to Write Plain English: A Book for Lawyers and Consumers. Harpercollins, New York (1979)
Google Scholar
Forti, L., Grego G., Santarelli, F., Santucci, V., Spina, S.: MALT-IT2: a new resource to measure text difficulty in light of CEFR levels for Italian l2 learning. In: 12th Language Resources and Evaluation Conference, pp. 7206–7213 (2020)
Google Scholar
François, T., Fairon, C.: An “AI readability” formula for French as a foreign language. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 466–477 (2012)
Google Scholar
Hancke, J., Meurers, D.: Exploring CEFR classification for German based on rich linguistic modeling. In: Learner Corpus Research, pp. 54–56 (2013)
Google Scholar
Jönsson, S., Rennes, E., Falkenjack, J., Jönsson, A.: A component based approach to measuring text complexity. In: The Seventh Swedish Language Technology Conference (SLTC-18), Stockholm, Sweden, 7–9 November 2018 (2018)
Google Scholar
Liu, Y., et al.: Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Martinc, M., Pollak, S., Robnik-Šikonja, M.: Supervised and unsupervised neural approaches to text readability (to be published)
Google Scholar
Miranda, N., Raminhos, R., Seabra, P., Sequeira, J., Gonçalves, T., Quaresma, P.: Named entity recognition using machine learning techniques. In: EPIA-11, 15th Portuguese Conference on Artificial Intelligence, pp. 818–831 (2011)
Google Scholar
Pilán, I., Volodina, E.: Investigating the importance of linguistic complexity features across different datasets related to language learning. In: Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing, pp. 49–58 (2018)
Google Scholar
Radford, A., et al.: Better language models and their implications. OpenAI Blog (2019). https://openai.com/blog/better-language-models
Reynolds, R.: Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories. In: Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 289–300 (2016)
Google Scholar
del Río, I.: Automatic proficiency classification in l2 Portuguese. Procesamiento del Lenguaje Nat. 63, 67–74 (2019)
Google Scholar
Rodrigues, J., Costa, F., Silva, J., Branco, A.: Automatic syllabification of Portuguese. Revista da Associação Portuguesa de Linguística (1), 715–720 (2020)
Google Scholar
Rodrigues, J., Branco, A., Neale, S., Silva, J.: LX-DSemVectors: distributional semantics models for Portuguese. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds.) PROPOR 2016. LNCS (LNAI), vol. 9727, pp. 259–270. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41552-9_27
Chapter Google Scholar
Santini, M., Jönsson, A., Rennes, E.: Visualizing facets of text complexity across registers. In: Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI), pp. 49–56 (2020)
Google Scholar
Santos, R., Silva, J., Branco, A., Xiong, D.: The direct path may not be the best: Portuguese-Chinese neural machine translation. In: Proceedings of the 19th EPIA Conference on Artificial Intelligence, pp. 757–768 (2019)
Google Scholar
Santos, R., et al.: Measuring the impact of readability features in fake news detection. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1404–1413 (2020)
Google Scholar
Santucci, V., Santarelli, F., Forti, L., Spina, S.: Automatic classification of text complexity. Appl. Sci. 10, 7285 (2020)
Article Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997)
Article Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Google Scholar
Silva, J., Branco, A., Castro, S., Reis, R.: Out-of-the-box robust parsing of Portuguese. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), pp. 75–85 (2009)
Google Scholar
Sung, Y.T., Lin, W.C., Dyson, S.B., Chang, K.E., Chen, Y.C.: Leveling l2 texts through readability: combining multilevel linguistic features with the CEFR. Mod. Lang. J. 99, 371–391 (2015)
Article Google Scholar
Tack, A., François, T., Desmet, P., Fairon, C.: NT2Lex: a CEFR-graded lexical resource for Dutch as a foreign language linked to open Dutch wordnet. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 137–146 (2018)
Google Scholar
Vajjala, S., Loo, K.: Automatic CEFR level prediction for Estonian learner text. In: Proceedings of the Third Workshop on NLP for Computer-Assisted Language Learning, pp. 113–127 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Veiga, A., Candeias, S., Perdigão, F.: Generating a pronunciation dictionary for European Portuguese using a joint-sequence model with embedded stress assignment. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Google Scholar

Download references

Acknowledgements

The work leading to the research results reported in this paper were mostly supported by Camões I.P. Instituto da Cooperação e da Língua. It was also partially supported by PORTULAN CLARIN Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016.

Author information

Authors and Affiliations

NLX—Natural Language and Speech Group, Department of Informatics, Faculdade de Ciências, University of Lisbon, 1749-016, Campo Grande, Lisbon, Portugal
Rodrigo Santos, João Rodrigues & António Branco
Camões I.P. Instituto da Cooperação e da Língua, Av. da Liberdade 270, 1250-149, Lisbon, Portugal
Rui Vaz

Authors

Rodrigo Santos
View author publications
You can also search for this author in PubMed Google Scholar
João Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
António Branco
View author publications
You can also search for this author in PubMed Google Scholar
Rui Vaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodrigo Santos .

Editor information

Editors and Affiliations

ISEP/GECAD, Polytechnic Institute of Porto, Porto, Portugal
Goreti Marreiros
IST/INESC-ID, University of Lisbon, Porto Salvo, Portugal
Francisco S. Melo
DETI/IEETA, University of Aveiro, Aveiro, Portugal
Nuno Lau
FEUP/LIACC, University of Porto, Porto, Portugal
Henrique Lopes Cardoso
FEUP/LIACC, University of Porto, Porto, Portugal
Luís Paulo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, R., Rodrigues, J., Branco, A., Vaz, R. (2021). Neural Text Categorization with Transformers for Learning Portuguese as a Second Language. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_56

Download citation

DOI: https://doi.org/10.1007/978-3-030-86230-5_56
Published: 03 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics