Word Embedding-Based Text Complexity Analysis

Choi, Kahyun

doi:10.1007/978-3-031-57867-0_21

Kahyun Choi ORCID: orcid.org/0000-0003-4854-7104¹⁴

Part of the book series: Lecture Notes in Computer Science ((volume 14598))

Included in the following conference series:

International Conference on Information

37 Accesses

Abstract

Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
https://github.com/RaRe-Technologies/gensim-data#models.

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
DuBay, W.H.: The principles of readability. Impact Information (2004)
Google Scholar
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060 (2016)
Google Scholar
Flor, M., Klebanov, B.B., Sheehan, K.M.: Lexical tightness and text complexity. In: Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, pp. 29–38 (2013)
Google Scholar
François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 49–57 (2012)
Google Scholar
Hiebert, E.H.: Readability and the Common Core’s staircase of text complexity. TextProject Inc, Santa Cruz, CA (2012)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural. Inf. Process. Syst. 27, 2177–2185 (2014)
Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Napolitano, D., Sheehan, K.M., Mundkowsky, R.: Online readability and text complexity analysis with textevaluator. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 96–100 (2015)
Google Scholar
Nelson, J., Perfetti, C., Liben, D., Liben, M.: Measures of text difficulty: testing their predictive value for grade levels and student performance. Council of Chief State School Officers, Washington, DC (2012)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162. https://aclanthology.org/D14-1162
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)
Google Scholar
Vajjala, S., Lučić, I.: Onestopenglish corpus: a new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)
Google Scholar
Zheng, J., Yu, H.: Assessing the readability of medical documents: a ranking approach. JMIR Med. Inform. 6(1), e8611 (2018)
Article Google Scholar

Download references

Acknowledgments

This work was supported by RE-252382-OLS-22 from the Institute of Museum and Library Services.

Author information

Authors and Affiliations

Indiana University Bloomington, Bloomington, USA
Kahyun Choi

Authors

Kahyun Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kahyun Choi .

Editor information

Editors and Affiliations

iSchool organization, Berlin, Germany
Isaac Sserwanga
University of Tsukuba, Tsukuba, Japan
Hideo Joho
Jilin University, Changchun, China
Jie Ma
Stockholm University, Kista, Sweden
Preben Hansen
Wuhan University, Wuhan, China
Dan Wu
University of Tsukuba, Tsukuba, Japan
Masanori Koizumi
University of California, Los Angeles, CA, USA
Anne J. Gilliland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choi, K. (2024). Word Embedding-Based Text Complexity Analysis. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-57867-0_21
Published: 10 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57866-3
Online ISBN: 978-3-031-57867-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics