Skip to main content

Advertisement

Log in

RefCit2vec: embedding models considering references and citations for measuring document similarity

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

This study outlines the intellectual structure of Library and Information Science in terms of the venues with RefCit2vec, an embedding method inspired by word2vec. The reference lists or cited-by lists of 62,077 articles in 35 venues (journals and proceedings) between 1928 and 2022 are converted into real number vectors by four independent models of RefCit2vec. The document similarities measured by the two models of RefCit2vec exhibit moderate correlations with bibliographical coupling metrics. In contrast, the similarities from the other two models moderately or strongly correlate with co-citation metrics. Each venue is represented by its centroid, the average vector of its constituent documents. By applying hierarchical agglomerative clustering on the venue centroids, 69% of venues robustly emerge in 6 out of 8 clusters. Four clusters consistently form the library-related branch. The bibliometrics/scientometrics branch contains only 1 cluster, whereas the information-related branch contains 3 clusters. 43% of venues are in six subgroups of consistent tree structures. An article is defined as SCIM-alike for it is closer to the SCIM centroid than half of SCIM articles are. 10% of JASIST articles are SCIM-alike upon their reference lists, and 5% of JASIST articles are SCIM-alike in terms of their cited-by lists. The percentage of SCIM-alike articles in JASIST hiked above the average between 2008 and 2018 but has dropped below the average since 2019. As we demonstrate the dynamics in LIS, citation embedding methods like RefCit2vec can incorporate citation-based, text-based, or authorship features to contribute to varied scenarios in investigating or exploring research fronts and scientific knowledge transfer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Data are retrieved via Elsevier Research Products APIs by Elsevier B.V.

References

  • Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63. https://doi.org/10.1016/j.joi.2008.11.003

    Article  Google Scholar 

  • Ali, Z., Qi, G., Muhammad, K., Khalil, A., Ullah, I., & Khan, A. (2021). Global citation recommendation employing multi-view heterogeneous network embedding. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), (pp. 1–6). https://doi.org/10.1109/ciss50987.2021.9400311

  • Ali, Z., Ullah, I., Khan, A., Ullah Jan, A., & Muhammad, K. (2021b). An overview and evaluation of citation recommendation models. Scientometrics, 126(5), 4083–4119. https://doi.org/10.1007/s11192-021-03909-y

    Article  Google Scholar 

  • Åström, F. (2007). Changes in the LIS research front: Time-sliced cocitation analyses of LIS journal articles, 1990–2004. Journal of the American Society for Information Science and Technology, 58(7), 947–957. https://doi.org/10.1002/asi.20567

    Article  Google Scholar 

  • Barkan, O., & Koenigstein, N. (2016). ITEM2VEC: Neural item embedding for collaborative filtering. In IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), (pp. 1–6). https://doi.org/10.1109/MLSP.2016.7738886

  • Berger, M., McDonough, K., & Seversky, L. M. (2017). Cite2vec: Citation-driven document exploration via word embeddings. IEEE Transactions on Visualization and Computer Graphics, 23(1), 691–700. https://doi.org/10.1109/TVCG.2016.2598667

    Article  Google Scholar 

  • Chen, T., Li, G., Deng, Q., & Wang, X. (2021). Using network embedding to obtain a richer and more stable network layout for a large scale bibliometric network. Journal of Data and Information Science, 6(1), 154–177. https://doi.org/10.2478/jdis-2021-0006

    Article  Google Scholar 

  • Choi, J., & Yoon, J. (2022). Measuring knowledge exploration distance at the patent level: Application of network embedding and citation analysis. Journal of Informetrics. https://doi.org/10.1016/j.joi.2022.101286

    Article  Google Scholar 

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3c391::AID-ASI1%3e3.0.CO;2-9

    Article  Google Scholar 

  • Egghe, L., & Rousseau, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. Elsevier Science Publishers.

    Google Scholar 

  • Egghe, L., & Rousseau, R. (2002). Co-citation, bibliographic coupling and a characterization of lattice citation networks. Scientometrics, 55(3), 349–361. https://doi.org/10.1023/A:1020458612014

    Article  Google Scholar 

  • Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In 39th European conference on IR Research, Aberdeen, UK.

  • Glänzel, W. (2015). Bibliometrics-aided retrieval: where information retrieval meets scientometrics. Scientometrics, 102(3), 2215–2222. https://doi.org/10.1007/s11192-014-1480-7

  • Good, B. H., De Montjoye, Y.-A., & Clauset, A. (2010). Performance of modularity maximization in practical contexts. Physical Review E, 81(4), 046106. https://doi.org/10.1103/PhysRevE.81.046106

    Article  MathSciNet  Google Scholar 

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning http://www.deeplearningbook.org

  • Grohe, M. (2020, June). word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (pp. 1-16).

  • Grover, A., & Leskovec, J. (2016, Aug). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 855–864).

  • Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.

  • He, J., & Chen, C. (2017). Understanding the changing roles of scientific publications via citation embeddings. In Proceedings of the second workshop on mining scientific papers: computational linguistics and bibliometrics (CLBib-2017), Wuhan, China.

  • Kobayashi, Y., Shimbo, M., & Matsumoto, Y. (2018). Citation recommendation using distributed representation of discourse facets in scientific articles. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, (pp. 243–251).

  • Leydesdorff, L., Bornmann, L., Marx, W., & Milojević, S. (2014). Referenced Publication Years Spectroscopy applied to iMetrics: Scientometrics, Journal of Informetrics, and a relevant subset of JASIST. Journal of Informetrics, 8(1), 162–174. https://doi.org/10.1016/j.joi.2013.11.006

    Article  Google Scholar 

  • Leydesdorff, L., & Cozzens, S. (1993). The delineation of specialties in terms of journals using the dynamic journal set of the SCI. Scientometrics, 26(1), 135–156.

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. Proceeding of the International Conference on Learning Representations Workshop. https://doi.org/10.48550/arXiv.1301.3781

    Article  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.

    Google Scholar 

  • Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, (pp. 746–751). https://www.aclweb.org/anthology/N13-1090/

  • Milojević, S., & Leydesdorff, L. (2013). Information metrics (iMetrics): A research specialty with a socio-cognitive identity? Scientometrics, 95(1), 141–157. https://doi.org/10.1007/s11192-012-0861-z

    Article  Google Scholar 

  • Milojević, S., Sugimoto, C. R., Yan, E., & Ding, Y. (2011). The cognitive structure of Library and Information Science: Analysis of article title words. Journal of the American Society for Information Science and Technology, 62(10), 1933–1953. https://doi.org/10.1002/asi.21602

    Article  Google Scholar 

  • Pan, V. Y., & Chen, Z. Q. (1999). The complexity of the matrix eigenproblem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, (pp. 507–516). https://doi.org/10.1145/301250.301389

  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, (pp. 1532–1543). https://www.aclweb.org/anthology/D14-1162/

  • Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 701–710).

  • Pornprasit, C., Liu, X., Kiattipadungkul, P., Kertkeidkachorn, N., Kim, K.-S., Noraset, T., Hassan, S.-U., & Tuarob, S. (2022). Enhancing citation recommendation using citation network embedding. Scientometrics, 127(1), 233–264. https://doi.org/10.1007/s11192-021-04196-3

    Article  Google Scholar 

  • Russell, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach (3rd ed.). Pearson Education, Inc.

  • Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-scale information network embedding. Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. https://doi.org/10.1145/2736277.2741093

  • Tian, H., & Zhuo, H. H. (2017). Paper2vec: Citation-context based document distributed representation for scholar recommendation. arXiv preprint. https://arxiv.org/abs/1703.06587

  • Wang, W., Xia, F., Wu, J., Gong, Z., Tong, H., & Davison, B. D. (2021). Scholar2vec: Vector representation of scholars for lifetime collaborator prediction. ACM Transactions on Knowledge Discovery from Data, 15(3), 1–19. https://doi.org/10.1145/3442199

    Article  Google Scholar 

  • White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972–1995. Journal of the American Society for Information Science, 49(4), 327–355. https://doi.org/10.1002/(SICI)1097-4571(19980401)49:4%3c327::AID-ASI4%3e3.0.CO;2-4

    Article  Google Scholar 

  • Xu, J., Shen, S., Li, D., & Fu, Y. (2018). A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (pp. 1735–1738).

  • Young, F. W., & Hamer, R. M. (1987). Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum Associates, Inc.

  • Zhang, Y., & Ma, Q. (2020). DocCit2Vec: Citation recommendation via embedding of content and structural contexts. IEEE Access, 8, 115865–115875. https://doi.org/10.1109/access.2020.3004599

    Article  Google Scholar 

Download references

Funding

This work was partially supported by National Science and Technology Council of the Republic of China (Grant No. MOST 110-2410-H-002-232-MY2).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Kuang-hua Chen and Chien-chih Huang. The first draft of the manuscript was written by Chien-chih Huang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kuang-hua Chen.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, Cc., Chen, Kh. RefCit2vec: embedding models considering references and citations for measuring document similarity. Scientometrics 129, 4669–4693 (2024). https://doi.org/10.1007/s11192-024-05067-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-024-05067-3

Keywords