Skip to main content
Log in

Identifying the driving factors of word co-occurrence: a perspective of semantic relations

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

This study aims to investigate and identify the driving factors of word co-occurrence from the perspective of semantic relations between frequently co-occurring words. Natural sentences in a corpus of news articles were used as co-occurrence windows to extract co-occurring word pairs, and the distance of those two words was not limited. ConceptNet (a semantic knowledge base) was used to annotate the semantic relation between co-occurring words. To solve the problem that some co-occurring word pairs fail to match direct semantic relations in ConceptNet, we proposed a relation annotation method by connecting them with an intermediate word. Results showed that six semantic relations in ConceptNet, (i.e., RelatedTo, IsA, Synonym, HasContext, Antonym, and MannerOf) were important factors directly inducing word co-occurrence. The combination of some of those semantic relations was an important factor indirectly driving word co-occurrence. Also, syntactic analysis and lexical semantic theories were combined to analyze the direct and indirect semantic relations. In this analysis, we found that the factors driving word co-occurrence in sentences could be classified into three relation categories: collocation and modification, hyponymy, and synonym and antonym. These findings can help explain the phenomenon of word co-occurrence and improve the method and application of co-word analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Adam, A. (2023, June 13). The New York Times. Encyclopædia Britannica. Retrieved June 23, 2023, from https://www.britannica.com/topic/The-New-York-Times

  • Alcaide-Muñoz, L., Rodríguez-Bolívar, M. P., Cobo, M. J., & Herrera-Viedma, E. (2017). Analysing the scientific evolution of e-government using a science mapping approach. Government Information Quarterly, 34(3), 545–555.

    Article  Google Scholar 

  • Balikas, G., Dias, G., Moraliyski, R., Akhmouch, H., & Amini, M.-R. (2019). Learning lexical–semantic relations using intuitive cognitive links. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff & D. Hiemstra (Eds.), Lecture notes in computer science: Advances in information retrieval (Vol. 11437, pp. 3–18). Springer.

  • Bannour, N., Dias, G., Chahir, Y., & Akhmouch, H. (2020). Patch-based identification of lexical semantic relations. In J. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. Silva & F. Martins (Eds.), Lecture notes in computer science: Advances in information retrieval (Vol. 12035, pp. 126–140). Springer.

  • Booth, A. D. (1967). A “Law” of occurrences for words of low frequency. Information and Control, 10(4), 386–393.

    Article  MATH  Google Scholar 

  • Bornmann, L., Haunschild, R., & Hug, S. E. (2018). Visualizing the context of citations referencing papers published by Eugene Garfield: A new type of keyword co-occurrence analysis. Scientometrics, 114(2), 427–437.

    Article  Google Scholar 

  • Callon, M., Courtial, J.-P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks: An introduction to co-word analysis. Social Science Information, 22(2), 191–235.

    Article  Google Scholar 

  • Chen, D., & Manning, C. (2014, October). A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014 (pp. 740–750).

  • Cruse, D. A. (1986). Lexical semantics. Cambridge University Press.

    Google Scholar 

  • Ding, W., & Chen, C. (2014). Dynamic topic detection and tracking: A comparison of HDP, C-word, and cocitation methods. Journal of the Association for Information Science and Technology, 65(10), 2084–2097.

    Article  Google Scholar 

  • Feng, J., Zhang, Y. Q., & Zhang, H. (2017). Improving the co-word analysis method based on semantic distance. Scientometrics, 111(3), 1521–1531.

    Article  Google Scholar 

  • Garg, M., & Kumar, M. (2020, January). Finding summaries to obtain event phrases from streaming Microblogs using Word Co-occurrence Network. In International conference on COMmunication Systems and NETworkS (COMSNETS), 2020 (pp. 200–206). IEEE.

  • Gelbukh, A., & Calvo, H. (2018). Automatic syntactic analysis based on selectional preferences. Springer.

    Book  Google Scholar 

  • Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. Longman Pub Group.

  • Hook, P. A. (2017). Using course-subject co-occurrence (CSCO) to reveal the structure of an academic discipline: A framework to evaluate different inputs of a domain map. Journal of the Association for Information Science and Technology, 68(1), 182–196.

    Article  Google Scholar 

  • Jackson, H., & Amvela, E. Z. (2000). Words, meaning and vocabulary: An introduction to modern English lexicology. Continuum International Publishing Group.

  • Jin, C. X., Zhang, H., & Bai, Q. C. (2014). Text clustering algorithm of co-occurrence word based on association-rule mining. Applied Mechanics and Materials, 599, 1749–1752.

    Article  Google Scholar 

  • Kastrin, A., Klisara, J., Lužar, B., et al. (2018). Is science driven by principal investigators? Scientometrics, 117(2), 1157–1182. https://doi.org/10.1007/s11192-018-2900-x

    Article  Google Scholar 

  • Kostoff, R. N., Eberhart, H. J., & Toothman, D. R. (1997). Database tomography for information retrieval. Journal of Information Science, 23(4), 301–311.

    Article  Google Scholar 

  • Kwiek, M. (2020). Internationalists and locals: International research collaboration in a resource-poor system. Scientometrics, 124(1), 57–105. https://doi.org/10.1007/s11192-020-03460-2

    Article  Google Scholar 

  • Leech, G. (1981). Semantics: The study of meaning: Geoffrey Leech. Penguin Books.

  • Li, T., Bai, J., Yang, X., Liu, Q., & Chen, Y. (2018). Co-occurrence Network of High-Frequency Words in the bioinformatics literature: Structural characteristics and evolution. Applied Sciences, 8(10), 1994.

    Article  Google Scholar 

  • Liang, Z., Mao, J., Lu, K., et al. (2021). Finding citations for PubMed: A large-scale comparison between five freely available bibliographic data sources. Scientometrics, 126(12), 9519–9542. https://doi.org/10.1007/s11192-021-04191-8

    Article  Google Scholar 

  • Liu, Y., McInnes, B. T., Pedersen, T., Melton-Meaux, G., & Pakhomov, S. (2012, January). Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet. In Proceedings of the 2nd ACM SIGHIT international health informatics symposium, 2012 (pp. 363–372).

  • Lu, H., Xie, L., Kang, N., Wang, C., & Xie, J. (2017, February). Don’t forget the quantifiable relationship between words: Using recurrent neural network for short text topic discovery. In Thirty-first AAAI conference on artificial intelligence, 2017.

  • Lu, S. Y., & Fu, K. S. (1978). A sentence-to-sentence clustering procedure for pattern analysis. IEEE Transactions on Systems, Man, and Cybernetics, 8(5), 381–389.

    Article  MathSciNet  MATH  Google Scholar 

  • Lu, W., Wang, J., & Hu, J. (2020). Analyzing the topic distribution and evolution of foreign relations from parliamentary debates: A framework and case study. Information Processing and Management, 57(3), 102191.

    Article  Google Scholar 

  • Mark, J. (2022, August 2). Fox News sweeps July cable news ratings as all networks see declines. Forbes. Retrieved June 23, 2023, from https://www.forbes.com/sites/forbes-personal-shopper/article/best-gaming-mouse/?sh=dca27511c4b1

  • Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.

    Article  Google Scholar 

  • Nasar, Z., Jaffry, S. W., & Malik, M. K. (2021). Named entity recognition and relation extraction. ACM Computing Surveys, 54(1), 1–39.

    Article  Google Scholar 

  • Nivre, J., De Marneffe, M. C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., & Tsarfaty, R. (2016, May). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC2016), 2016 (pp. 1659–1666).

  • NLTK Project. (2020, March). Natural Language Toolkit—NLTK 3.5b1 documentation. Retrieved September 10, 2021, from https://www.nltk.org/

  • Pao, M. L. (1978). Automatic text analysis based on transition phenomena of word occurrences. Journal of the American Society for Information Science, 29(3), 121–124.

    Article  Google Scholar 

  • Qiu, J., Li, L., & Wu, L. (2008, October). The research on semantic transitivity. In 4th International conference on wireless communications, networking and mobile computing, 2008 (pp. 1–4).

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Sarah, S. (2020, May). The New York Times’ success with digital subscriptions is accelerating, not slowing down. NiemanLab. Retrieved June 23, 2023, from https://www.niemanlab.org/2020/05/the-new-york-times-success-with-digital-subscriptions-is-accelerating-not-slowing-down/

  • Shams, M., & Baraani-Dastjerdi, A. (2017). Enriched LDA (ELDA): Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction. Expert Systems with Applications, 80, 136–146.

    Article  Google Scholar 

  • Shin, S., Jin, X., Jung, J., & Lee, K. (2019). Predicate constraints based question answering over knowledge graph. Information Processing and Management, 56(3), 445–462.

    Article  Google Scholar 

  • Shu, D. (2000). An introduction to contemporary linguistic semantics. Shanghai Foreign Language Education Press.

  • Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

    Google Scholar 

  • Speer, R. (2019, June). Relations in ConceptNet 5. Retrieved September 10, 2021, from https://github.com/commonsense/conceptnet5/wiki/Relations

  • Speer R. (2021, September). FAQ of ConceptNet 5. Retrieved September 10, 2021, from https://github.com/commonsense/conceptnet5/wiki/FAQ

  • Speer, R., Chin, J., & Havasi, C. (2017, February). Conceptnet. In Thirty-first AAAI conference on artificial intelligence, 5.5: An open multilingual graph of general knowledge, 2017.

  • Strohman, T., Metzler, D., Turtle, H., & Croft, W. (2005) Indri: A language model-based search engine for complex queries. In Proceedings of the international conference on intelligent analysis, 2005 (Vol. 2(6), pp. 2–6).

  • Swanson, D. R. (1986). Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 30(1), 7–18.

    Article  Google Scholar 

  • Thompson, A. (2017). All the news. Kaggle. Retrieved June 23, 2023, from https://www.kaggle.com/datasets/snapcrack/all-the-news

  • Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., et al. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95–98.

    Article  Google Scholar 

  • Vo, D., & Bagheri, E. (2019). Feature-enriched matrix factorization for relation extraction. Information Processing and Management, 56(3), 424–444.

    Article  Google Scholar 

  • Wang, W. (2001). English lexical semantics. Zhejiang Education Publishing House.

  • Wang, Z., Li, G., Li, C., & Li, A. (2012). Research on the semantic-based co-word analysis. Scientometrics, 90(3), 855–875.

    Article  Google Scholar 

  • Wei, W., Guo, C., Chen, J., & Zhang, Z. (2017, November). Textual topic evolution analysis based on term co-occurrence: A case study on the government work report of the State Council (1954–2017). In 12th International conference on intelligent systems and knowledge engineering (ISKE), 2017 (pp. 1–6).

  • Whittaker, J. (1989). Creativity and conformity in science: Titles, keywords and co-word analysis. Social Studies of Science, 19(3), 473–496.

    Article  MathSciNet  Google Scholar 

  • Yang, S., Huang, G., & Ofoghi, B. (2020, May). Short text similarity measurement using context from bag of word pairs and word co-occurrence. In Communications in computer and information science international conference on data service (pp. 221–231). Springer.

  • Yumoto, T., Yamanaka, T., Nii, M., & Kamiura, N. (2016, December). Rarity-oriented information retrieval: Social Bookmarking vs. word Co-occurrence. In Lecture notes in computer science (pp. 85–91). Springer.

  • Zhang, H., Bai, J., Song, Y., Xu, K., Yu, C., Song, Y., Wilfred, N., & Yu, D. (2019a). Multiplex word embeddings for selectional preference acquisition. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing—EMNLP’19, Hong Kong, China, 2019 (pp. 5247–5256).

  • Zhang, H., Ding, H., & Song, Y. (2019b). Sp-10k: A large-scale evaluation set for selectional preference acquisition. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics—ACL’19, Florence, Italy, 2019 (pp. 722–731).

  • Zhang, J., & Zhao, Y. (2013). A user term visualization analysis based on a social question and answer log. Information Processing and Management, 49(3), 1019–1048.

    Article  Google Scholar 

  • Zhang, J., Zhao, Y., & Dimitroff, A. (2014). A study on health care consumers’ diabetes term usage across identified categories. Aslib Journal of Information Management, 66(4), 443–463.

    Article  Google Scholar 

  • Zhang, Y., Wang, X., Zhang, G., & Lu, J. (2018). Predicting the dynamics of scientific activities: A diffusion-based network analytic methodology. Proceedings of the Association for Information Science and Technology, 55(1), 598–607.

    Article  Google Scholar 

  • Zhao, Y., Chen, B., Zhang, J., Ding, Y., Mao, J., & Zhou, L. (2018). An investigation on the evolution of diabetes data in social Q&A logs. Data and Information Management, 2(1), 37–48.

    Article  Google Scholar 

Download references

Acknowledgments

This research is funded by the National Key Research and Development Program of China (2019YFA0707201), National Natural Science Foundation of China (72274146, 71874130 & 71921002), and the Ministry of Education of the People’s Republic of China (22JJD870004 & 18YJC870026).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiming Zhao.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y., Yin, J., Zhang, J. et al. Identifying the driving factors of word co-occurrence: a perspective of semantic relations. Scientometrics 128, 6471–6494 (2023). https://doi.org/10.1007/s11192-023-04851-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-023-04851-x

Keywords

Navigation