Readability Classification with Wikipedia Data and All-MiniLM Embeddings

Vergou, Elena; Pagouni, Ioanna; Nanos, Marios; Kermanidis, Katia Lida

doi:10.1007/978-3-031-34171-7_30

Elena Vergou¹⁹,
Ioanna Pagouni¹⁹,
Marios Nanos¹⁹ &
…
Katia Lida Kermanidis¹⁹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 677))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

573 Accesses
1 Citations
1 Altmetric

Abstract

Evaluating the readability of text has been a critical step in several applications, ranging from text simplification, learning new languages, providing school children with appropriate reading material to conveying important medical information in an easily understandable way. A lot of research has been dedicated to evaluating readability on larger bodies of texts, like articles and paragraphs, but the application on single sentences has received less attention. In this paper, we explore several machine learning techniques - logistic regression, random forest, Naive Bayes, KNN, MLP, XGBoost - on a corpus of sentences from the English and simple English Wikipedia. We build and compare a series of binary readability classifiers using extracted features as well as generated all-MiniLM-L6-v2-based embeddings, and evaluate them against standard classification evaluation metrics. To the authors’ knowledge, this is the first time this sentence transformer is used in the task of readability assessment. Overall, we found that the MLP models, with and without embeddings, as well as the Random Forest, outperformed the other machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://enirisst-plus.gr/.

References

Nadeem, F., Ostendorf, M.: Estimating Linguistic Complexity for Science Texts. In: 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, pp. 4541–4551 (2019)
Google Scholar
Schwarm, S., Ostendorf, M.: Reading Level Assessment Using Support Vector Machines and Statistical Language Models. In: 43rd Annal Meeting of the Association for Computational Linguistics (ACL), Michigan, USA, pp. 497–504 (2005)
Google Scholar
Kauchak, D., Mouradi, O., Pentoney, C., Leroy, G.: Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text. IEEE Trans. Learn. Technol. 7(3), 276–288 (2014)
Google Scholar
Vajjala, S., Meurers, D.: Assessing the Relative Reading Level of Sentence Pairs for Text Simplification. In: 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, pp. 482–492 (2012)
Google Scholar
Nisioi, S., Štajner, S., Ponzetto, S.P., Dinu, L. P.: Exploring Neural Text Simplification Models. In: 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1083–1092 (2017)
Google Scholar
Saggion, H.: Automatic Text Simplification. Cham, Springer. Switzerland (2017)
Google Scholar
Flesch, R.: The Art of Readable Writing. Harper, New York (1949)
Google Scholar
Kincaid, P., Robert P., Fishburne, R., Rogers, L., Chissom, B.S.: Derivation of new readability formulas (Automated Readability Index, Fog count and Flesch Reading Ease Formula) for Navy enlisted personnel. Technical report, Naval Technical Training Command. (1975)https://doi.org/10.1007/978-3-031-02166-4
Sander Wubben, S., van den Bosch, A., Krahmer, E.: Sentence simplification by monolingual machine translation. Long Papers. In: 50th Annual Meeting of the Association for Computational Linguistics 1, 1015–1024 (2012)
Google Scholar
Si, L., Callan, J.: A statistical model for scientific readability. In: 10th International Conference on Information and Knowledge Management, CIKM, pp. 574–576, New York. ACM (2001)
Google Scholar
Collins-Thompson, K., Callan, J.P.: A language modeling approach to predicting reading difficulty. HLT-NAACL, 193–200 (2004)
Google Scholar
Kanungo, T., Orr, D.: Predicting the readability of short web summaries. Second ACM International Conference on Web Search and Data Mining, pp. 202–211. ACM (2009)
Google Scholar
Garbacea, C., Guo, M., Carton, S., Mei, Q.: Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. In: 57th Annual Meeting of the Association for Computational Linguistics, pp. 2254–2264 (2019)
Google Scholar
Aluisio, S., Specia, L., Gasperin, C., and Scarton, C.: Readability Assessment for Text Simplification. In: 27th International Conference on Computational Linguistics, pp. 1246–1257 (2018)
Google Scholar
Hugging Face. (n.d.). Sentence Transformers: all-MiniLM-L6-v2. Retrieved from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Kauchak, D.: Data and Code for Automatic Text Simplification. Retrieved from https://cs.pomona.edu/~dkauchak/simplification/
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)
Google Scholar
Sentence Transformers: pre-trained models evaluation https://www.sbert.net/docs/pretrained_models.html

Download references

Acknowledgments

This research was co-financed by the European Union and Greek national funds through the “Competitiveness, Entrepreneurship and Innovation” Operational Programme 2014–2020, under the Call “Support for regional excellence”; project title: “Intelligent Research Infrastructure for Shipping, Transport and Supply Chain - ENIRISST+”; MIS code: 5047041.

Author information

Authors and Affiliations

Department of Informatics, Ionian University, Corfu, Greece
Elena Vergou, Ioanna Pagouni, Marios Nanos & Katia Lida Kermanidis

Authors

Elena Vergou
View author publications
You can also search for this author in PubMed Google Scholar
Ioanna Pagouni
View author publications
You can also search for this author in PubMed Google Scholar
Marios Nanos
View author publications
You can also search for this author in PubMed Google Scholar
Katia Lida Kermanidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katia Lida Kermanidis .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Hellenic Telecom Organization OTE, Athens, Greece
Ioannis Chochliouros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vergou, E., Pagouni, I., Nanos, M., Kermanidis, K.L. (2023). Readability Classification with Wikipedia Data and All-MiniLM Embeddings. In: Maglogiannis, I., Iliadis, L., Papaleonidas, A., Chochliouros, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops. AIAI 2023. IFIP Advances in Information and Communication Technology, vol 677. Springer, Cham. https://doi.org/10.1007/978-3-031-34171-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-34171-7_30
Published: 02 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34170-0
Online ISBN: 978-3-031-34171-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)