Abstract
This paper presents a study on text complexity of Open Educational Resources (OER) in Brazilian Portuguese. In a data analysis of the Brazilian Ministry of Education Integrated Platform (MEC-RED) carried out in September 2020, 86% of the resources on the platform did not have any grade level classification, making it difficult to find, use, and expand them. The text complexity task in the Natural Language Processing research area can be used to identify texts that have adequate linguistic complexity for specific grade levels, allowing to complete the stage of education metadata in MEC-RED. However, some types of MEC-RED’s resources do not present any information about their stage of education, making it unfeasible to compile a balanced dataset of OER for training a text complexity predictor. This study is driven and enabled by a recently created corpus of transcribed spoken narratives produced by fourth graders to first graders of high school which were collected to evaluate the development of language abilities. A multi-task learning (MTL) approach via hard parameter sharing of hidden layers was adopted to train three models that share all parameters in their hidden layers. The main objective of this study was to explore the relationship between three text complexity tasks by jointly learning to predict text readability, using coarse and fine-grained datasets of written, spoken and domain texts (a small dataset of OER resources) to overcome the lack of grade classified resources in MEC-RED. Our MTL model with two auxiliary tasks presents a F-measure of 0.955, an improvement of 0.15 points over our previous results.




Data availability
Datasets are available on the github of the project: github.com/gazzola/MTC-DTG.
Code availability
The trained models are available on the github of the project: github.com/gazzola/MTC-DTG.
Notes
SimpleLogistic model of the Weka tool.
Several texts were grouped to obtain a reasonable size.
NILC-Metrix version 2021 is the current version of the tool, has 200 metrics, and is available at http://fw.nilc.icmc.usp.br:23380/nilcmetrix.
Scores are based on a 7-item likert scale and the metrics assess four score ranges on the scale.
References
Aluísio, S., Cunha, A., & Scarton, C. (2016). Evaluating progression of Alzheimer’s disease by regression and classification methods in a narrative language test in portuguese. In J. Silva, R. Ribeiro, P. Quaresma, A. Adami, & A. Branco (Eds.), Computational processing of the Portuguese language (pp. 109–114). Springer International Publishing.
Aluisio, S., Specia, L., Gasperin, C., & Scarton, C. (2010). Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–9).
Arfé, B., Mason, L., & Fajardo, I. (2018). Simplifying informational text structure for struggling readers. Reading and Writing, 31, 2191–2210.
Bick, E. (2000). The parsing system Palavras automatic grammatical analysis of Portuguese in a constraint grammar framework. University of Arhus.
Caruana, R. (1997). Multitask learning. Machine Learning - Special Issue on Inductive Transfer, 28, 41–75.
Chen, M., & Zechner, K. (2011). Computing and evaluating syntactic complexity features for automated scoring of spontaneous non-native speech. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics (pp 722–731). Portland, Oregon, USA. Retrieved from https://www.aclweb.org/anthology/P11-1073
Crossley, S., & McNamara, D. (2013). Applications of text analysis tools for spoken response grading. Language Learning & Technology, 17(2), 171–192.
Deutsch, T., Jasbi, M., & Shieber, S. (2020). Linguistic features for readability assessment. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 1–17). Seattle, WA, USA. Retrieved from https://doi.org/10.18653/v1/2020.bea-1.1, https://www.aclweb.org/anthology/2020.bea-1.1
dos Santos, L. B., Duran, M. S., Hartmann, N. S., Jr, A. C., Paetzold, G. H., & Aluísio, S. M. (2017). A lightweight regression method to infer psycholinguistic properties for Brazilian Portuguese. In Ekstein ,K., Matousek, V. (Eds.), Text, Speech, and Dialogue - 20th International Conference TSD 2017, Prague, Czech Republic, August 27–31, 2017. Proceedings, Springer, Lecture Notes in Computer Science (vol. 10415, pp. 281–289). Retrieved from https://doi.org/10.1007/978-3-319-64206-2_32, https://doi.org/10.1007/978-3-319-64206-2_32
Fang, Z. (2016). Text complexity in the us common core state standards: A linguistic critique. Australian Journal of Language and Literacy, 39(3), 195–206.
Gago, P. C. (2002). Questões de transcrição em análise da conversa. Veredas-Revista de Estudos Linguístico, 6(2), 89–113.
Gazzola, M., Leal, S., & Aluísio, S. (2019). Predição da complexidade textual de recursos educacionais abertos em português. In 12th Brazilian Symposium in Information and Human Language Technology (STIL 2019), Brazilian Computer Society (SBC) (pp. 1–10).
Gonzalez-Garduño, A. V., & Søgaard, A. (2017). Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics (pp. 438–443).
Graesser, A. C., & McNamara, D. S. (2011). Computational analyses of multilevel discourse comprehension. Topics in Cognitive Science, 3(2), 371–398.
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.
Hartmann, N., Cucatto, L., Brants, D., & Aluísio, S. (2016). Automatic classification of the complexity of nonfiction texts in portuguese for early school years. In International Conference on Computational Processing of the Portuguese Language (pp. 12–24). Springer.
Klie, J. C., Bugert, M., Boullosa, B., de Castilho, R. E., & Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, Association for Computational Linguistics (pp. 5–9). Retrieved from http://tubiblio.ulb.tu-darmstadt.de/106270/
Landauer, T. K., & Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211
Leal, S. E., Munguba Vieira, J. M., dos Santos Rodrigues, E., Nogueira Teixeira, E., & Aluísio, S. (2020). Using eye-tracking data to predict the readability of Brazilian Portuguese sentences in single-task, multi-task and sequential transfer learning approaches. In Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics (pp. 5821–5831). Barcelona, Spain (Online). Retrieved from https://doi.org/10.18653/v1/2020.coling-main.512, https://www.aclweb.org/anthology/2020.coling-main.512
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu
Marcuschi, L. A. (1986). Análise da conversação. Série Princípios.
Martins, T., Ghiraldelo, C., Nunes, M., & Jr, O. (1996). Readability formulas applied to textbooks in Brazilian Portuguese. Série Computação 28, ICMSC-USP, martins, T. B. F., Ghiraldelo, C. M., Nunes, M. G. V., Oliveira Jr., O. N. Readability formulas applied to textbooks in Brazilian Portuguese. Notas do ICMSC-USP, Série Computação (nro. 28, p. 11).
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153–157. https://doi.org/10.1007/bf02295996
MEC. (2019). Termos de serviços - Plataforma MEC-RED. Retrieved August 8, 2021 from https://plataformaintegrada.mec.gov.br/termos
Miao, F., Mishra, S., & McGreal, R. (2016). Open educational resources: Policy, costs, transformation. UNESCO Publishing.
Nadeem, F., & Ostendorf, M. (2018). Estimating linguistic complexity for science texts. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications (pp. 45–55).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. Retrieved from http://arxiv.org/org/abs/1706.05098
Scarton, C. E., & Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh-metrix para o português. Linguamática, 2(1), 45–61.
UNESCO. (2002). Forum on the impact of open courseware for higher education in developing countries: final report.
Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Association for Computational Linguistics (pp. 163–173) Montréal, Canada. Retrieved from https://www.aclweb.org/anthology/W12-2019
Vajjala, S., & Meurers, D. (2013). On the applicability of readability models to web texts. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, Association for Computational Linguistics Sofia, Bulgaria (pp. 59–68). Retrieved from https://www.aclweb.org/anthology/W13-2907
Vajjala, S., & Meurers, D. (2014a). Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 288–297).
Vajjala, S., & Meurers, D. (2014b). Exploring measures of “readability” for spoken language: Analyzing linguistic features of subtitles to identify age-specific TV programs. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) (pp. 21–29). Association for Computational Linguistics, Gothenburg, Sweden. Retrieved from https://doi.org/10.3115/v1/W14-1203, https://www.aclweb.org/anthology/W14-1203
Wagner Filho, J. A., Wilkens, R., Idiart, M., & Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). (pp. 4339–4344). European Language Resources Association (ELRA), Miyazaki, Japan. Retrieved from https://www.aclweb.org/anthology/L18-1686
Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by readability level. In International Conference on Computational Processing of the Portuguese Language (pp. 306–318). Springer.
Wiley, D., Bliss, T., & McEwen, M. (2014). Open educational resources: A review of the literature. In M SJMMEJB (ed.), Handbook of research on educational communications and technology. (pp. 781–789). New York: Springer.
Acknowledgements
The authors thank the following agencies for the financial support of the project Adole-sendo (FAPESP to SP#2016/14750-0 and Grant of BP 2020/01091-3), CAPES (finance code 001; pos-doctoral Grant to FTR # 88887.357997/2019-00; and PhD Grant of MG PROEX-8436630/D), CNPq (due to Research Productivity Grant to SP # 301899/2019-3) and AFIP.
Funding
This research was supported by three Brazilian Funding agencies: The São Paulo Research Foundation (FAPESP); National Council for Scientific and Technological Development (CNPq) and Coordination for the Improvement of Higher Education Personnel (CAPES).
Author information
Authors and Affiliations
Contributions
SA, MG and SL conceived the presented idea. SP and MG gathered the data for analysis. BP, FTR, MG and SA manually annotated the transcribed narratives. SL carried out the automatic analysis of the language samples. MG and SL implemented the multi-task learning methods. All authors contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gazzola, M., Leal, S., Pedroni, B. et al. Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach. Lang Resources & Evaluation 56, 621–650 (2022). https://doi.org/10.1007/s10579-021-09571-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09571-3