Skip to main content
Log in

Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents a study on text complexity of Open Educational Resources (OER) in Brazilian Portuguese. In a data analysis of the Brazilian Ministry of Education Integrated Platform (MEC-RED) carried out in September 2020, 86% of the resources on the platform did not have any grade level classification, making it difficult to find, use, and expand them. The text complexity task in the Natural Language Processing research area can be used to identify texts that have adequate linguistic complexity for specific grade levels, allowing to complete the stage of education metadata in MEC-RED. However, some types of MEC-RED’s resources do not present any information about their stage of education, making it unfeasible to compile a balanced dataset of OER for training a text complexity predictor. This study is driven and enabled by a recently created corpus of transcribed spoken narratives produced by fourth graders to first graders of high school which were collected to evaluate the development of language abilities. A multi-task learning (MTL) approach via hard parameter sharing of hidden layers was adopted to train three models that share all parameters in their hidden layers. The main objective of this study was to explore the relationship between three text complexity tasks by jointly learning to predict text readability, using coarse and fine-grained datasets of written, spoken and domain texts (a small dataset of OER resources) to overcome the lack of grade classified resources in MEC-RED. Our MTL model with two auxiliary tasks presents a F-measure of 0.955, an improvement of 0.15 points over our previous results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Data availability

Datasets are available on the github of the project: github.com/gazzola/MTC-DTG.

Code availability

The trained models are available on the github of the project: github.com/gazzola/MTC-DTG.

Notes

  1. https://cohmetrix.com/.

  2. https://liwc.wpengine.com/.

  3. https://plataformaintegrada.mec.gov.br/.

  4. https://weka.sourceforge.io/doc.dev/weka/filters/supervised/instance/ClassBalancer.html.

  5. https://textcomplexity.questarai.com/getdrp/.

  6. SimpleLogistic model of the Weka tool.

  7. https://data.allenai.org/ai2-science-questions-mercury/.

  8. https://corestandards.org.

  9. https://zh.clicrbs.com.br/rs.

  10. Several texts were grouped to obtain a reasonable size.

  11. https://portal.inep.gov.br/educacao-basica/saeb.

  12. https://adole-sendo.info/.

  13. NILC-Metrix version 2021 is the current version of the tool, has 200 metrics, and is available at http://fw.nilc.icmc.usp.br:23380/nilcmetrix.

  14. http://corpusbrasileiro.pucsp.br/.

  15. Scores are based on a 7-item likert scale and the metrics assess four score ranges on the scale.

  16. https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html.

  17. https://github.com/gazzola/MTC-DTG.

  18. http://lxcenter.di.fc.ul.pt/tools/pt/conteudo/LXParser.html.

  19. http://www.maltparser.org/.

  20. https://sites.google.com/icmc.usp.br/poetisa.

References

  • Aluísio, S., Cunha, A., & Scarton, C. (2016). Evaluating progression of Alzheimer’s disease by regression and classification methods in a narrative language test in portuguese. In J. Silva, R. Ribeiro, P. Quaresma, A. Adami, & A. Branco (Eds.), Computational processing of the Portuguese language (pp. 109–114). Springer International Publishing.

    Chapter  Google Scholar 

  • Aluisio, S., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–9).

  • Arfé, B., Mason, L., & Fajardo, I. (2018). Simplifying informational text structure for struggling readers. Reading and Writing, 31, 2191–2210.

    Article  Google Scholar 

  • Bick, E. (2000). The parsing system Palavras automatic grammatical analysis of Portuguese in a constraint grammar framework. University of Arhus.

    Google Scholar 

  • Caruana, R. (1997). Multitask learning. Machine Learning - Special Issue on Inductive Transfer, 28, 41–75.

    Article  Google Scholar 

  • Chen, M., & Zechner, K. (2011). Computing and evaluating syntactic complexity features for automated scoring of spontaneous non-native speech. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (pp 722–731). Portland, Oregon, USA. Retrieved from https://www.aclweb.org/anthology/P11-1073

  • Crossley, S., & McNamara, D. (2013). Applications of text analysis tools for spoken response grading. Language Learning & Technology, 17(2), 171–192.

    Google Scholar 

  • Deutsch, T., Jasbi, M., & Shieber, S. (2020). Linguistic features for readability assessment. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–17). Seattle, WA, USA. Retrieved from https://doi.org/10.18653/v1/2020.bea-1.1, https://www.aclweb.org/anthology/2020.bea-1.1

  • dos Santos, L. B., Duran, M. S., Hartmann, N. S., Jr, A. C., Paetzold, G. H., & Aluísio, S. M. (2017). A lightweight regression method to infer psycholinguistic properties for Brazilian Portuguese. In Ekstein ,K., Matousek, V. (Eds.), Text, Speech, and Dialogue - 20th International Conference TSD 2017, Prague, Czech Republic, August 27–31, 2017. Proceedings, Springer, Lecture Notes in Computer Science (vol. 10415, pp. 281–289). Retrieved from https://doi.org/10.1007/978-3-319-64206-2_32, https://doi.org/10.1007/978-3-319-64206-2_32

  • Fang, Z. (2016). Text complexity in the us common core state standards: A linguistic critique. Australian Journal of Language and Literacy, 39(3), 195–206.

    Google Scholar 

  • Gago, P. C. (2002). Questões de transcrição em análise da conversa. Veredas-Revista de Estudos Linguístico, 6(2), 89–113.

    Google Scholar 

  • Gazzola, M., Leal, S., & Aluísio, S. (2019). Predição da complexidade textual de recursos educacionais abertos em português. In 12th Brazilian Symposium in Information and Human Language Technology (STIL 2019), Brazilian Computer Society (SBC) (pp. 1–10).

  • Gonzalez-Garduño, A. V., & Søgaard, A. (2017). Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 438–443).

  • Graesser, A. C., & McNamara, D. S. (2011). Computational analyses of multilevel discourse comprehension. Topics in Cognitive Science, 3(2), 371–398.

    Article  Google Scholar 

  • Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.

    Article  Google Scholar 

  • Hartmann, N., Cucatto, L., Brants, D., & Aluísio, S. (2016). Automatic classification of the complexity of nonfiction texts in portuguese for early school years. In International Conference on Computational Processing of the Portuguese Language (pp. 12–24). Springer.

  • Klie, J. C., Bugert, M., Boullosa, B., de Castilho, R. E., & Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, Association for Computational Linguistics (pp. 5–9). Retrieved from http://tubiblio.ulb.tu-darmstadt.de/106270/

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211

    Article  Google Scholar 

  • Leal, S. E., Munguba Vieira, J. M., dos Santos Rodrigues, E., Nogueira Teixeira, E., & Aluísio, S. (2020). Using eye-tracking data to predict the readability of Brazilian Portuguese sentences in single-task, multi-task and sequential transfer learning approaches. In Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics (pp. 5821–5831). Barcelona, Spain (Online). Retrieved from https://doi.org/10.18653/v1/2020.coling-main.512, https://www.aclweb.org/anthology/2020.coling-main.512

  • Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu

    Article  Google Scholar 

  • Marcuschi, L. A. (1986). Análise da conversação. Série Princípios.

    Google Scholar 

  • Martins, T., Ghiraldelo, C., Nunes, M., & Jr, O. (1996). Readability formulas applied to textbooks in Brazilian Portuguese. Série Computação 28, ICMSC-USP, martins, T. B. F., Ghiraldelo, C. M., Nunes, M. G. V., Oliveira Jr., O. N. Readability formulas applied to textbooks in Brazilian Portuguese. Notas do ICMSC-USP, Série Computação (nro. 28, p. 11).

  • McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/bf02295996

    Article  Google Scholar 

  • MEC. (2019). Termos de serviços - Plataforma MEC-RED. Retrieved August 8, 2021 from https://plataformaintegrada.mec.gov.br/termos

  • Miao, F., Mishra, S., & McGreal, R. (2016). Open educational resources: Policy, costs, transformation. UNESCO Publishing.

  • Nadeem, F., & Ostendorf, M. (2018). Estimating linguistic complexity for science texts. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications (pp. 45–55).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  • Ruder, S. (2017). An overview of multi-task learning in deep neural networks. Retrieved from http://arxiv.org/org/abs/1706.05098

  • Scarton, C. E., & Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh-metrix para o português. Linguamática, 2(1), 45–61.

    Google Scholar 

  • UNESCO. (2002). Forum on the impact of open courseware for higher education in developing countries: final report.

  • Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics (pp. 163–173) Montréal, Canada. Retrieved from https://www.aclweb.org/anthology/W12-2019

  • Vajjala, S., & Meurers, D. (2013). On the applicability of readability models to web texts. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, Association for Computational Linguistics Sofia, Bulgaria (pp. 59–68). Retrieved from https://www.aclweb.org/anthology/W13-2907

  • Vajjala, S., & Meurers, D. (2014a). Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 288–297).

  • Vajjala, S., & Meurers, D. (2014b). Exploring measures of “readability” for spoken language: Analyzing linguistic features of subtitles to identify age-specific TV programs. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) (pp. 21–29). Association for Computational Linguistics, Gothenburg, Sweden. Retrieved from https://doi.org/10.3115/v1/W14-1203, https://www.aclweb.org/anthology/W14-1203

  • Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). (pp. 4339–4344). European Language Resources Association (ELRA), Miyazaki, Japan. Retrieved from https://www.aclweb.org/anthology/L18-1686

  • Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by readability level. In International Conference on Computational Processing of the Portuguese Language (pp. 306–318). Springer.

  • Wiley, D., Bliss, T., & McEwen, M. (2014). Open educational resources: A review of the literature. In M SJMMEJB (ed.), Handbook of research on educational communications and technology. (pp. 781–789). New York: Springer.

Download references

Acknowledgements

The authors thank the following agencies for the financial support of the project Adole-sendo (FAPESP to SP#2016/14750-0 and Grant of BP 2020/01091-3), CAPES (finance code 001; pos-doctoral Grant to FTR # 88887.357997/2019-00; and PhD Grant of MG PROEX-8436630/D), CNPq (due to Research Productivity Grant to SP # 301899/2019-3) and AFIP.

Funding

This research was supported by three Brazilian Funding agencies: The São Paulo Research Foundation (FAPESP); National Council for Scientific and Technological Development (CNPq) and Coordination for the Improvement of Higher Education Personnel (CAPES).

Author information

Authors and Affiliations

Authors

Contributions

SA, MG and SL conceived the presented idea. SP and MG gathered the data for analysis. BP, FTR, MG and SA manually annotated the transcribed narratives. SL carried out the automatic analysis of the language samples. MG and SL implemented the multi-task learning methods. All authors contributed to the final manuscript.

Corresponding author

Correspondence to Sandra Aluísio.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gazzola, M., Leal, S., Pedroni, B. et al. Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach. Lang Resources & Evaluation 56, 621–650 (2022). https://doi.org/10.1007/s10579-021-09571-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09571-3

Keywords