Skip to main content

Readability Classification with Wikipedia Data and All-MiniLM Embeddings

  • Conference paper
  • First Online:
Artificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops (AIAI 2023)

Abstract

Evaluating the readability of text has been a critical step in several applications, ranging from text simplification, learning new languages, providing school children with appropriate reading material to conveying important medical information in an easily understandable way. A lot of research has been dedicated to evaluating readability on larger bodies of texts, like articles and paragraphs, but the application on single sentences has received less attention. In this paper, we explore several machine learning techniques - logistic regression, random forest, Naive Bayes, KNN, MLP, XGBoost - on a corpus of sentences from the English and simple English Wikipedia. We build and compare a series of binary readability classifiers using extracted features as well as generated all-MiniLM-L6-v2-based embeddings, and evaluate them against standard classification evaluation metrics. To the authors’ knowledge, this is the first time this sentence transformer is used in the task of readability assessment. Overall, we found that the MLP models, with and without embeddings, as well as the Random Forest, outperformed the other machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://enirisst-plus.gr/.

References

  1. Nadeem, F., Ostendorf, M.: Estimating Linguistic Complexity for Science Texts. In: 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, pp. 4541–4551 (2019)

    Google Scholar 

  2. Schwarm, S., Ostendorf, M.: Reading Level Assessment Using Support Vector Machines and Statistical Language Models. In: 43rd Annal Meeting of the Association for Computational Linguistics (ACL), Michigan, USA, pp. 497–504 (2005)

    Google Scholar 

  3. Kauchak, D., Mouradi, O., Pentoney, C., Leroy, G.: Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text. IEEE Trans. Learn. Technol. 7(3), 276–288 (2014)

    Google Scholar 

  4. Vajjala, S., Meurers, D.: Assessing the Relative Reading Level of Sentence Pairs for Text Simplification. In: 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, pp. 482–492 (2012)

    Google Scholar 

  5. Nisioi, S., Štajner, S., Ponzetto, S.P., Dinu, L. P.: Exploring Neural Text Simplification Models. In: 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1083–1092 (2017)

    Google Scholar 

  6. Saggion, H.: Automatic Text Simplification. Cham, Springer. Switzerland (2017)

    Google Scholar 

  7. Flesch, R.: The Art of Readable Writing. Harper, New York (1949)

    Google Scholar 

  8. Kincaid, P., Robert P., Fishburne, R., Rogers, L., Chissom, B.S.: Derivation of new readability formulas (Automated Readability Index, Fog count and Flesch Reading Ease Formula) for Navy enlisted personnel. Technical report, Naval Technical Training Command. (1975)https://doi.org/10.1007/978-3-031-02166-4

  9. Sander Wubben, S., van den Bosch, A., Krahmer, E.: Sentence simplification by monolingual machine translation. Long Papers. In: 50th Annual Meeting of the Association for Computational Linguistics 1, 1015–1024 (2012)

    Google Scholar 

  10. Si, L., Callan, J.: A statistical model for scientific readability. In: 10th International Conference on Information and Knowledge Management, CIKM, pp. 574–576, New York. ACM (2001)

    Google Scholar 

  11. Collins-Thompson, K., Callan, J.P.: A language modeling approach to predicting reading difficulty. HLT-NAACL, 193–200 (2004)

    Google Scholar 

  12. Kanungo, T., Orr, D.: Predicting the readability of short web summaries. Second ACM International Conference on Web Search and Data Mining, pp. 202–211. ACM (2009)

    Google Scholar 

  13. Garbacea, C., Guo, M., Carton, S., Mei, Q.: Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. In: 57th Annual Meeting of the Association for Computational Linguistics, pp. 2254–2264 (2019)

    Google Scholar 

  14. Aluisio, S., Specia, L., Gasperin, C., and Scarton, C.: Readability Assessment for Text Simplification. In: 27th International Conference on Computational Linguistics, pp. 1246–1257 (2018)

    Google Scholar 

  15. Hugging Face. (n.d.). Sentence Transformers: all-MiniLM-L6-v2. Retrieved from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  16. Kauchak, D.: Data and Code for Automatic Text Simplification. Retrieved from https://cs.pomona.edu/~dkauchak/simplification/

  17. Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)

    Google Scholar 

  18. Sentence Transformers: pre-trained models evaluation https://www.sbert.net/docs/pretrained_models.html

Download references

Acknowledgments

This research was co-financed by the European Union and Greek national funds through the “Competitiveness, Entrepreneurship and Innovation” Operational Programme 2014–2020, under the Call “Support for regional excellence”; project title: “Intelligent Research Infrastructure for Shipping, Transport and Supply Chain - ENIRISST+”; MIS code: 5047041.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katia Lida Kermanidis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vergou, E., Pagouni, I., Nanos, M., Kermanidis, K.L. (2023). Readability Classification with Wikipedia Data and All-MiniLM Embeddings. In: Maglogiannis, I., Iliadis, L., Papaleonidas, A., Chochliouros, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops. AIAI 2023. IFIP Advances in Information and Communication Technology, vol 677. Springer, Cham. https://doi.org/10.1007/978-3-031-34171-7_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34171-7_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34170-0

  • Online ISBN: 978-3-031-34171-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics