Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach

Gazzola, Murilo; Leal, Sidney; Pedroni, Breno; Theoto Rocha, Fábio; Pompéia, Sabine; Aluísio, Sandra

doi:10.1007/s10579-021-09571-3

Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach

Project Notes
Published: 10 January 2022

Volume 56, pages 621–650, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

606 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents a study on text complexity of Open Educational Resources (OER) in Brazilian Portuguese. In a data analysis of the Brazilian Ministry of Education Integrated Platform (MEC-RED) carried out in September 2020, 86% of the resources on the platform did not have any grade level classification, making it difficult to find, use, and expand them. The text complexity task in the Natural Language Processing research area can be used to identify texts that have adequate linguistic complexity for specific grade levels, allowing to complete the stage of education metadata in MEC-RED. However, some types of MEC-RED’s resources do not present any information about their stage of education, making it unfeasible to compile a balanced dataset of OER for training a text complexity predictor. This study is driven and enabled by a recently created corpus of transcribed spoken narratives produced by fourth graders to first graders of high school which were collected to evaluate the development of language abilities. A multi-task learning (MTL) approach via hard parameter sharing of hidden layers was adopted to train three models that share all parameters in their hidden layers. The main objective of this study was to explore the relationship between three text complexity tasks by jointly learning to predict text readability, using coarse and fine-grained datasets of written, spoken and domain texts (a small dataset of OER resources) to overcome the lack of grade classified resources in MEC-RED. Our MTL model with two auxiliary tasks presents a F-measure of 0.955, an improvement of 0.15 points over our previous results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

Datasets are available on the github of the project: github.com/gazzola/MTC-DTG.

Code availability

The trained models are available on the github of the project: github.com/gazzola/MTC-DTG.

Notes

https://cohmetrix.com/.
https://liwc.wpengine.com/.
https://plataformaintegrada.mec.gov.br/.
https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/ClassBalancer.html.
https://textcomplexity.questarai.com/getdrp/.
SimpleLogistic model of the Weka tool.
https://data.allenai.org/ai2-science-questions-mercury/.
https://corestandards.org.
https://zh.clicrbs.com.br/rs.
Several texts were grouped to obtain a reasonable size.
https://portal.inep.gov.br/educacao-basica/saeb.
https://adole-sendo.info/.
NILC-Metrix version 2021 is the current version of the tool, has 200 metrics, and is available at http://fw.nilc.icmc.usp.br:23380/nilcmetrix.
http://corpusbrasileiro.pucsp.br/.
Scores are based on a 7-item likert scale and the metrics assess four score ranges on the scale.
https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html.
https://github.com/gazzola/MTC-DTG.
http://lxcenter.di.fc.ul.pt/tools/pt/conteudo/LXParser.html.
http://www.maltparser.org/.
https://sites.google.com/icmc.usp.br/poetisa.

References

Aluísio, S., Cunha, A., & Scarton, C. (2016). Evaluating progression of Alzheimer’s disease by regression and classification methods in a narrative language test in portuguese. In J. Silva, R. Ribeiro, P. Quaresma, A. Adami, & A. Branco (Eds.), Computational processing of the Portuguese language (pp. 109–114). Springer International Publishing.
Chapter Google Scholar
Aluisio, S., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–9).
Arfé, B., Mason, L., & Fajardo, I. (2018). Simplifying informational text structure for struggling readers. Reading and Writing, 31, 2191–2210.
Article Google Scholar
Bick, E. (2000). The parsing system Palavras automatic grammatical analysis of Portuguese in a constraint grammar framework. University of Arhus.
Google Scholar
Caruana, R. (1997). Multitask learning. Machine Learning - Special Issue on Inductive Transfer, 28, 41–75.
Article Google Scholar
Chen, M., & Zechner, K. (2011). Computing and evaluating syntactic complexity features for automated scoring of spontaneous non-native speech. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (pp 722–731). Portland, Oregon, USA. Retrieved from https://www.aclweb.org/anthology/P11-1073
Crossley, S., & McNamara, D. (2013). Applications of text analysis tools for spoken response grading. Language Learning & Technology, 17(2), 171–192.
Google Scholar
Deutsch, T., Jasbi, M., & Shieber, S. (2020). Linguistic features for readability assessment. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–17). Seattle, WA, USA. Retrieved from https://doi.org/10.18653/v1/2020.bea-1.1, https://www.aclweb.org/anthology/2020.bea-1.1
dos Santos, L. B., Duran, M. S., Hartmann, N. S., Jr, A. C., Paetzold, G. H., & Aluísio, S. M. (2017). A lightweight regression method to infer psycholinguistic properties for Brazilian Portuguese. In Ekstein ,K., Matousek, V. (Eds.), Text, Speech, and Dialogue - 20th International Conference TSD 2017, Prague, Czech Republic, August 27–31, 2017. Proceedings, Springer, Lecture Notes in Computer Science (vol. 10415, pp. 281–289). Retrieved from https://doi.org/10.1007/978-3-319-64206-2_32, https://doi.org/10.1007/978-3-319-64206-2_32
Fang, Z. (2016). Text complexity in the us common core state standards: A linguistic critique. Australian Journal of Language and Literacy, 39(3), 195–206.
Google Scholar
Gago, P. C. (2002). Questões de transcrição em análise da conversa. Veredas-Revista de Estudos Linguístico, 6(2), 89–113.
Google Scholar
Gazzola, M., Leal, S., & Aluísio, S. (2019). Predição da complexidade textual de recursos educacionais abertos em português. In 12th Brazilian Symposium in Information and Human Language Technology (STIL 2019), Brazilian Computer Society (SBC) (pp. 1–10).
Gonzalez-Garduño, A. V., & Søgaard, A. (2017). Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 438–443).
Graesser, A. C., & McNamara, D. S. (2011). Computational analyses of multilevel discourse comprehension. Topics in Cognitive Science, 3(2), 371–398.
Article Google Scholar
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.
Article Google Scholar
Hartmann, N., Cucatto, L., Brants, D., & Aluísio, S. (2016). Automatic classification of the complexity of nonfiction texts in portuguese for early school years. In International Conference on Computational Processing of the Portuguese Language (pp. 12–24). Springer.
Klie, J. C., Bugert, M., Boullosa, B., de Castilho, R. E., & Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, Association for Computational Linguistics (pp. 5–9). Retrieved from http://tubiblio.ulb.tu-darmstadt.de/106270/
Landauer, T. K., & Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211
Article Google Scholar
Leal, S. E., Munguba Vieira, J. M., dos Santos Rodrigues, E., Nogueira Teixeira, E., & Aluísio, S. (2020). Using eye-tracking data to predict the readability of Brazilian Portuguese sentences in single-task, multi-task and sequential transfer learning approaches. In Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics (pp. 5821–5831). Barcelona, Spain (Online). Retrieved from https://doi.org/10.18653/v1/2020.coling-main.512, https://www.aclweb.org/anthology/2020.coling-main.512
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu
Article Google Scholar
Marcuschi, L. A. (1986). Análise da conversação. Série Princípios.
Google Scholar
Martins, T., Ghiraldelo, C., Nunes, M., & Jr, O. (1996). Readability formulas applied to textbooks in Brazilian Portuguese. Série Computação 28, ICMSC-USP, martins, T. B. F., Ghiraldelo, C. M., Nunes, M. G. V., Oliveira Jr., O. N. Readability formulas applied to textbooks in Brazilian Portuguese. Notas do ICMSC-USP, Série Computação (nro. 28, p. 11).
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/bf02295996
Article Google Scholar
MEC. (2019). Termos de serviços - Plataforma MEC-RED. Retrieved August 8, 2021 from https://plataformaintegrada.mec.gov.br/termos
Miao, F., Mishra, S., & McGreal, R. (2016). Open educational resources: Policy, costs, transformation. UNESCO Publishing.
Nadeem, F., & Ostendorf, M. (2018). Estimating linguistic complexity for science texts. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications (pp. 45–55).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Google Scholar
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. Retrieved from http://arxiv.org/org/abs/1706.05098
Scarton, C. E., & Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh-metrix para o português. Linguamática, 2(1), 45–61.
Google Scholar
UNESCO. (2002). Forum on the impact of open courseware for higher education in developing countries: final report.
Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics (pp. 163–173) Montréal, Canada. Retrieved from https://www.aclweb.org/anthology/W12-2019
Vajjala, S., & Meurers, D. (2013). On the applicability of readability models to web texts. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, Association for Computational Linguistics Sofia, Bulgaria (pp. 59–68). Retrieved from https://www.aclweb.org/anthology/W13-2907
Vajjala, S., & Meurers, D. (2014a). Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 288–297).
Vajjala, S., & Meurers, D. (2014b). Exploring measures of “readability” for spoken language: Analyzing linguistic features of subtitles to identify age-specific TV programs. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) (pp. 21–29). Association for Computational Linguistics, Gothenburg, Sweden. Retrieved from https://doi.org/10.3115/v1/W14-1203, https://www.aclweb.org/anthology/W14-1203
Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). (pp. 4339–4344). European Language Resources Association (ELRA), Miyazaki, Japan. Retrieved from https://www.aclweb.org/anthology/L18-1686
Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by readability level. In International Conference on Computational Processing of the Portuguese Language (pp. 306–318). Springer.
Wiley, D., Bliss, T., & McEwen, M. (2014). Open educational resources: A review of the literature. In M SJMMEJB (ed.), Handbook of research on educational communications and technology. (pp. 781–789). New York: Springer.

Download references

Acknowledgements

The authors thank the following agencies for the financial support of the project Adole-sendo (FAPESP to SP#2016/14750-0 and Grant of BP 2020/01091-3), CAPES (finance code 001; pos-doctoral Grant to FTR # 88887.357997/2019-00; and PhD Grant of MG PROEX-8436630/D), CNPq (due to Research Productivity Grant to SP # 301899/2019-3) and AFIP.

Funding

This research was supported by three Brazilian Funding agencies: The São Paulo Research Foundation (FAPESP); National Council for Scientific and Technological Development (CNPq) and Coordination for the Improvement of Higher Education Personnel (CAPES).

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brazil
Murilo Gazzola, Sidney Leal & Sandra Aluísio
Department of Psychobiology, Universidade Federal de São Paulo, São Paulo, Brazil
Breno Pedroni, Fábio Theoto Rocha & Sabine Pompéia

Authors

Murilo Gazzola
View author publications
You can also search for this author inPubMed Google Scholar
Sidney Leal
View author publications
You can also search for this author inPubMed Google Scholar
Breno Pedroni
View author publications
You can also search for this author inPubMed Google Scholar
Fábio Theoto Rocha
View author publications
You can also search for this author inPubMed Google Scholar
Sabine Pompéia
View author publications
You can also search for this author inPubMed Google Scholar
Sandra Aluísio
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

SA, MG and SL conceived the presented idea. SP and MG gathered the data for analysis. BP, FTR, MG and SA manually annotated the transcribed narratives. SL carried out the automatic analysis of the language samples. MG and SL implemented the multi-task learning methods. All authors contributed to the final manuscript.

Corresponding author

Correspondence to Sandra Aluísio.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gazzola, M., Leal, S., Pedroni, B. et al. Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach. Lang Resources & Evaluation 56, 621–650 (2022). https://doi.org/10.1007/s10579-021-09571-3

Download citation

Accepted: 23 November 2021
Published: 10 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10579-021-09571-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach

Abstract

Access this article

Subscribe and save

Buy Now

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now