Abstract
Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
DuBay, W.H.: The principles of readability. Impact Information (2004)
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060 (2016)
Flor, M., Klebanov, B.B., Sheehan, K.M.: Lexical tightness and text complexity. In: Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, pp. 29–38 (2013)
François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 49–57 (2012)
Hiebert, E.H.: Readability and the Common Core’s staircase of text complexity. TextProject Inc, Santa Cruz, CA (2012)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural. Inf. Process. Syst. 27, 2177–2185 (2014)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Napolitano, D., Sheehan, K.M., Mundkowsky, R.: Online readability and text complexity analysis with textevaluator. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 96–100 (2015)
Nelson, J., Perfetti, C., Liben, D., Liben, M.: Measures of text difficulty: testing their predictive value for grade levels and student performance. Council of Chief State School Officers, Washington, DC (2012)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162. https://aclanthology.org/D14-1162
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Vajjala, S., Lučić, I.: Onestopenglish corpus: a new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)
Zheng, J., Yu, H.: Assessing the readability of medical documents: a ranking approach. JMIR Med. Inform. 6(1), e8611 (2018)
Acknowledgments
This work was supported by RE-252382-OLS-22 from the Institute of Museum and Library Services.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Choi, K. (2024). Word Embedding-Based Text Complexity Analysis. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-57867-0_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57866-3
Online ISBN: 978-3-031-57867-0
eBook Packages: Computer ScienceComputer Science (R0)