Skip to main content

Word Embedding-Based Text Complexity Analysis

  • Conference paper
  • First Online:
Wisdom, Well-Being, Win-Win (iConference 2024)

Part of the book series: Lecture Notes in Computer Science ((volume 14598))

Included in the following conference series:

  • 37 Accesses

Abstract

Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    https://github.com/RaRe-Technologies/gensim-data#models.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  2. DuBay, W.H.: The principles of readability. Impact Information (2004)

    Google Scholar 

  3. Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1057–1060 (2016)

    Google Scholar 

  4. Flor, M., Klebanov, B.B., Sheehan, K.M.: Lexical tightness and text complexity. In: Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, pp. 29–38 (2013)

    Google Scholar 

  5. François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, pp. 49–57 (2012)

    Google Scholar 

  6. Hiebert, E.H.: Readability and the Common Core’s staircase of text complexity. TextProject Inc, Santa Cruz, CA (2012)

    Google Scholar 

  7. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext. zip: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)

  8. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural. Inf. Process. Syst. 27, 2177–2185 (2014)

    Google Scholar 

  9. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Article  Google Scholar 

  10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  11. Napolitano, D., Sheehan, K.M., Mundkowsky, R.: Online readability and text complexity analysis with textevaluator. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 96–100 (2015)

    Google Scholar 

  12. Nelson, J., Perfetti, C., Liben, D., Liben, M.: Measures of text difficulty: testing their predictive value for grade levels and student performance. Council of Chief State School Officers, Washington, DC (2012)

    Google Scholar 

  13. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162. https://aclanthology.org/D14-1162

  14. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  15. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015)

    Google Scholar 

  16. Vajjala, S., Lučić, I.: Onestopenglish corpus: a new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)

    Google Scholar 

  17. Zheng, J., Yu, H.: Assessing the readability of medical documents: a ranking approach. JMIR Med. Inform. 6(1), e8611 (2018)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by RE-252382-OLS-22 from the Institute of Museum and Library Services.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kahyun Choi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Choi, K. (2024). Word Embedding-Based Text Complexity Analysis. In: Sserwanga, I., et al. Wisdom, Well-Being, Win-Win. iConference 2024. Lecture Notes in Computer Science, vol 14598. Springer, Cham. https://doi.org/10.1007/978-3-031-57867-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57867-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57866-3

  • Online ISBN: 978-3-031-57867-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics