Abstract
Evaluating the readability of text has been a critical step in several applications, ranging from text simplification, learning new languages, providing school children with appropriate reading material to conveying important medical information in an easily understandable way. A lot of research has been dedicated to evaluating readability on larger bodies of texts, like articles and paragraphs, but the application on single sentences has received less attention. In this paper, we explore several machine learning techniques - logistic regression, random forest, Naive Bayes, KNN, MLP, XGBoost - on a corpus of sentences from the English and simple English Wikipedia. We build and compare a series of binary readability classifiers using extracted features as well as generated all-MiniLM-L6-v2-based embeddings, and evaluate them against standard classification evaluation metrics. To the authors’ knowledge, this is the first time this sentence transformer is used in the task of readability assessment. Overall, we found that the MLP models, with and without embeddings, as well as the Random Forest, outperformed the other machine learning algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Nadeem, F., Ostendorf, M.: Estimating Linguistic Complexity for Science Texts. In: 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, pp. 4541–4551 (2019)
Schwarm, S., Ostendorf, M.: Reading Level Assessment Using Support Vector Machines and Statistical Language Models. In: 43rd Annal Meeting of the Association for Computational Linguistics (ACL), Michigan, USA, pp. 497–504 (2005)
Kauchak, D., Mouradi, O., Pentoney, C., Leroy, G.: Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text. IEEE Trans. Learn. Technol. 7(3), 276–288 (2014)
Vajjala, S., Meurers, D.: Assessing the Relative Reading Level of Sentence Pairs for Text Simplification. In: 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, pp. 482–492 (2012)
Nisioi, S., Štajner, S., Ponzetto, S.P., Dinu, L. P.: Exploring Neural Text Simplification Models. In: 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1083–1092 (2017)
Saggion, H.: Automatic Text Simplification. Cham, Springer. Switzerland (2017)
Flesch, R.: The Art of Readable Writing. Harper, New York (1949)
Kincaid, P., Robert P., Fishburne, R., Rogers, L., Chissom, B.S.: Derivation of new readability formulas (Automated Readability Index, Fog count and Flesch Reading Ease Formula) for Navy enlisted personnel. Technical report, Naval Technical Training Command. (1975)https://doi.org/10.1007/978-3-031-02166-4
Sander Wubben, S., van den Bosch, A., Krahmer, E.: Sentence simplification by monolingual machine translation. Long Papers. In: 50th Annual Meeting of the Association for Computational Linguistics 1, 1015–1024 (2012)
Si, L., Callan, J.: A statistical model for scientific readability. In: 10th International Conference on Information and Knowledge Management, CIKM, pp. 574–576, New York. ACM (2001)
Collins-Thompson, K., Callan, J.P.: A language modeling approach to predicting reading difficulty. HLT-NAACL, 193–200 (2004)
Kanungo, T., Orr, D.: Predicting the readability of short web summaries. Second ACM International Conference on Web Search and Data Mining, pp. 202–211. ACM (2009)
Garbacea, C., Guo, M., Carton, S., Mei, Q.: Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. In: 57th Annual Meeting of the Association for Computational Linguistics, pp. 2254–2264 (2019)
Aluisio, S., Specia, L., Gasperin, C., and Scarton, C.: Readability Assessment for Text Simplification. In: 27th International Conference on Computational Linguistics, pp. 1246–1257 (2018)
Hugging Face. (n.d.). Sentence Transformers: all-MiniLM-L6-v2. Retrieved from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Kauchak, D.: Data and Code for Automatic Text Simplification. Retrieved from https://cs.pomona.edu/~dkauchak/simplification/
Bird, S., Loper, E., Klein, E.: Natural Language Processing with Python. O’Reilly Media Inc. (2009)
Sentence Transformers: pre-trained models evaluation https://www.sbert.net/docs/pretrained_models.html
Acknowledgments
This research was co-financed by the European Union and Greek national funds through the “Competitiveness, Entrepreneurship and Innovation” Operational Programme 2014–2020, under the Call “Support for regional excellence”; project title: “Intelligent Research Infrastructure for Shipping, Transport and Supply Chain - ENIRISST+”; MIS code: 5047041.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 IFIP International Federation for Information Processing
About this paper
Cite this paper
Vergou, E., Pagouni, I., Nanos, M., Kermanidis, K.L. (2023). Readability Classification with Wikipedia Data and All-MiniLM Embeddings. In: Maglogiannis, I., Iliadis, L., Papaleonidas, A., Chochliouros, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2023 IFIP WG 12.5 International Workshops. AIAI 2023. IFIP Advances in Information and Communication Technology, vol 677. Springer, Cham. https://doi.org/10.1007/978-3-031-34171-7_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-34171-7_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34170-0
Online ISBN: 978-3-031-34171-7
eBook Packages: Computer ScienceComputer Science (R0)