Skip to main content

A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements

  • Conference paper
  • First Online:
Human Interaction and Emerging Technologies (IHIET 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1018))

  • 3952 Accesses

Abstract

In this paper we propose a data driven methodology to assess text complexity of Spanish school texts. We model the problem as a classification task, that can be solved in a data-driven fashion using machine learning techniques. We show empirically that the discriminative power of the classifier depends on school grade level. Our proposal includes multiple predictors that capture different dimensions of text complexity such as coherence and cohesion. We provide an importance analysis of predictors across several complexity levels. Finally, we assess the model performance using accuracy and correlation measurements. The proposed model achieves accuracies of 0.7.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  3. Hempelmann, C.F., Dufty, D., McCarthy, P.M., Graesser, A.C., Cai, Z., McNamara, D.S.: Using LSA to automatically identify givenness and newness of noun phrases in written discourse. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society, pp. 941–946. Erlbaum, Mahwah (2005)

    Google Scholar 

  4. Crossley, S.A., Kyle, K., McNamara, D.S.: The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behav. Res. Methods 48(4), 1227–1237 (2016)

    Article  Google Scholar 

  5. Barzilay, R., Lapata, M.: Modeling local coherence: an entity-based approach. Comput. Linguist. 34(1), 1–34 (2008)

    Article  Google Scholar 

  6. Guinaudeau, C., Strube, M.: Graph-based local coherence modeling. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 93–103 (2013)

    Google Scholar 

  7. Salesky, E., Shen, W.: Exploiting morphological, grammatical, and semantic correlates for improved text difficulty assessment. In: Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 155–162 (2014)

    Google Scholar 

  8. Palma, D., Atkinson, J.: Coherence-based automatic essay assessment. IEEE Intell. Syst. 33(5), 26–36 (2018)

    Article  Google Scholar 

  9. Kincaid, J.P., Fishburne Jr., R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (1975)

    Google Scholar 

  10. Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-Metrix: analysis of text on cohesion and language. Behav. Res. Methods Instrum. Comput. 36(2), 193–202 (2004)

    Article  Google Scholar 

  11. Wade-Stein, D., Kintsch, E.: Summary Street: interactive computer support for writing. Cogn. Instr. 22(3), 333–362 (2004)

    Article  Google Scholar 

  12. Honnibal, M., Montani, I.: spacy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing (2017, to appear)

    Google Scholar 

  13. Cristian Cardellino: Spanish Billion Words Corpus and Embeddings, March 2016. https://crscardellino.github.io/SBWCE/

  14. Kursa, M.B., Jankowski, A., Rudnicki, W.R.: Boruta–a system for feature selection. Fundamenta Informaticae 101(4), 271–285 (2010)

    MathSciNet  Google Scholar 

Download references

Acknowledgments

This research was supported by FONDEF (Chile) under Grant IT17I0051 “Desarrollo de una herramienta computacional para la evaluación automática de textos en el sistema escolar chileno.” (“Development of a computational tool for automatic assessment of Chilean school texts”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Palma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Palma, D., Soto, C., Veliz, M., Riffo, B., Gutiérrez, A. (2020). A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements. In: Ahram, T., Taiar, R., Colson, S., Choplin, A. (eds) Human Interaction and Emerging Technologies. IHIET 2019. Advances in Intelligent Systems and Computing, vol 1018. Springer, Cham. https://doi.org/10.1007/978-3-030-25629-6_79

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-25629-6_79

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-25628-9

  • Online ISBN: 978-3-030-25629-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics