RefCit2vec: embedding models considering references and citations for measuring document similarity

Huang, Chien-chih; Chen, Kuang-hua

doi:10.1007/s11192-024-05067-3

RefCit2vec: embedding models considering references and citations for measuring document similarity

Published: 10 July 2024

Volume 129, pages 4669–4693, (2024)
Cite this article

Scientometrics Aims and scope Submit manuscript

447 Accesses
Explore all metrics

Abstract

This study outlines the intellectual structure of Library and Information Science in terms of the venues with RefCit2vec, an embedding method inspired by word2vec. The reference lists or cited-by lists of 62,077 articles in 35 venues (journals and proceedings) between 1928 and 2022 are converted into real number vectors by four independent models of RefCit2vec. The document similarities measured by the two models of RefCit2vec exhibit moderate correlations with bibliographical coupling metrics. In contrast, the similarities from the other two models moderately or strongly correlate with co-citation metrics. Each venue is represented by its centroid, the average vector of its constituent documents. By applying hierarchical agglomerative clustering on the venue centroids, 69% of venues robustly emerge in 6 out of 8 clusters. Four clusters consistently form the library-related branch. The bibliometrics/scientometrics branch contains only 1 cluster, whereas the information-related branch contains 3 clusters. 43% of venues are in six subgroups of consistent tree structures. An article is defined as SCIM-alike for it is closer to the SCIM centroid than half of SCIM articles are. 10% of JASIST articles are SCIM-alike upon their reference lists, and 5% of JASIST articles are SCIM-alike in terms of their cited-by lists. The percentage of SCIM-alike articles in JASIST hiked above the average between 2008 and 2018 but has dropped below the average since 2019. As we demonstrate the dynamics in LIS, citation embedding methods like RefCit2vec can incorporate citation-based, text-based, or authorship features to contribute to varied scenarios in investigating or exploring research fronts and scientific knowledge transfer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LD Connect: A Linked Data Portal for IOS Press Scientometrics

Funding map using paragraph embedding based on semantic diversity

Article Open access 28 May 2018

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Article 13 July 2020

Data availability

Data are retrieved via Elsevier Research Products APIs by Elsevier B.V.

References

Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63. https://doi.org/10.1016/j.joi.2008.11.003
Article Google Scholar
Ali, Z., Qi, G., Muhammad, K., Khalil, A., Ullah, I., & Khan, A. (2021). Global citation recommendation employing multi-view heterogeneous network embedding. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), (pp. 1–6). https://doi.org/10.1109/ciss50987.2021.9400311
Ali, Z., Ullah, I., Khan, A., Ullah Jan, A., & Muhammad, K. (2021b). An overview and evaluation of citation recommendation models. Scientometrics, 126(5), 4083–4119. https://doi.org/10.1007/s11192-021-03909-y
Article Google Scholar
Åström, F. (2007). Changes in the LIS research front: Time-sliced cocitation analyses of LIS journal articles, 1990–2004. Journal of the American Society for Information Science and Technology, 58(7), 947–957. https://doi.org/10.1002/asi.20567
Article Google Scholar
Barkan, O., & Koenigstein, N. (2016). ITEM2VEC: Neural item embedding for collaborative filtering. In IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), (pp. 1–6). https://doi.org/10.1109/MLSP.2016.7738886
Berger, M., McDonough, K., & Seversky, L. M. (2017). Cite2vec: Citation-driven document exploration via word embeddings. IEEE Transactions on Visualization and Computer Graphics, 23(1), 691–700. https://doi.org/10.1109/TVCG.2016.2598667
Article Google Scholar
Chen, T., Li, G., Deng, Q., & Wang, X. (2021). Using network embedding to obtain a richer and more stable network layout for a large scale bibliometric network. Journal of Data and Information Science, 6(1), 154–177. https://doi.org/10.2478/jdis-2021-0006
Article Google Scholar
Choi, J., & Yoon, J. (2022). Measuring knowledge exploration distance at the patent level: Application of network embedding and citation analysis. Journal of Informetrics. https://doi.org/10.1016/j.joi.2022.101286
Article Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3c391::AID-ASI1%3e3.0.CO;2-9
Article Google Scholar
Egghe, L., & Rousseau, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. Elsevier Science Publishers.
Google Scholar
Egghe, L., & Rousseau, R. (2002). Co-citation, bibliographic coupling and a characterization of lattice citation networks. Scientometrics, 55(3), 349–361. https://doi.org/10.1023/A:1020458612014
Article Google Scholar
Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining graph and text information for scientific paper representation. In 39th European conference on IR Research, Aberdeen, UK.
Glänzel, W. (2015). Bibliometrics-aided retrieval: where information retrieval meets scientometrics. Scientometrics, 102(3), 2215–2222. https://doi.org/10.1007/s11192-014-1480-7
Good, B. H., De Montjoye, Y.-A., & Clauset, A. (2010). Performance of modularity maximization in practical contexts. Physical Review E, 81(4), 046106. https://doi.org/10.1103/PhysRevE.81.046106
Article MathSciNet Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning http://www.deeplearningbook.org
Grohe, M. (2020, June). word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (pp. 1-16).
Grover, A., & Leskovec, J. (2016, Aug). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 855–864).
Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.
He, J., & Chen, C. (2017). Understanding the changing roles of scientific publications via citation embeddings. In Proceedings of the second workshop on mining scientific papers: computational linguistics and bibliometrics (CLBib-2017), Wuhan, China.
Kobayashi, Y., Shimbo, M., & Matsumoto, Y. (2018). Citation recommendation using distributed representation of discourse facets in scientific articles. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, (pp. 243–251).
Leydesdorff, L., Bornmann, L., Marx, W., & Milojević, S. (2014). Referenced Publication Years Spectroscopy applied to iMetrics: Scientometrics, Journal of Informetrics, and a relevant subset of JASIST. Journal of Informetrics, 8(1), 162–174. https://doi.org/10.1016/j.joi.2013.11.006
Article Google Scholar
Leydesdorff, L., & Cozzens, S. (1993). The delineation of specialties in terms of journals using the dynamic journal set of the SCI. Scientometrics, 26(1), 135–156.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. Proceeding of the International Conference on Learning Representations Workshop. https://doi.org/10.48550/arXiv.1301.3781
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.
Google Scholar
Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, (pp. 746–751). https://www.aclweb.org/anthology/N13-1090/
Milojević, S., & Leydesdorff, L. (2013). Information metrics (iMetrics): A research specialty with a socio-cognitive identity? Scientometrics, 95(1), 141–157. https://doi.org/10.1007/s11192-012-0861-z
Article Google Scholar
Milojević, S., Sugimoto, C. R., Yan, E., & Ding, Y. (2011). The cognitive structure of Library and Information Science: Analysis of article title words. Journal of the American Society for Information Science and Technology, 62(10), 1933–1953. https://doi.org/10.1002/asi.21602
Article Google Scholar
Pan, V. Y., & Chen, Z. Q. (1999). The complexity of the matrix eigenproblem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, (pp. 507–516). https://doi.org/10.1145/301250.301389
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, (pp. 1532–1543). https://www.aclweb.org/anthology/D14-1162/
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, (pp. 701–710).
Pornprasit, C., Liu, X., Kiattipadungkul, P., Kertkeidkachorn, N., Kim, K.-S., Noraset, T., Hassan, S.-U., & Tuarob, S. (2022). Enhancing citation recommendation using citation network embedding. Scientometrics, 127(1), 233–264. https://doi.org/10.1007/s11192-021-04196-3
Article Google Scholar
Russell, S. J., & Norvig, P. (2010). Artificial intelligence: A modern approach (3rd ed.). Pearson Education, Inc.
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., & Mei, Q. (2015). LINE: Large-scale information network embedding. Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. https://doi.org/10.1145/2736277.2741093
Tian, H., & Zhuo, H. H. (2017). Paper2vec: Citation-context based document distributed representation for scholar recommendation. arXiv preprint. https://arxiv.org/abs/1703.06587
Wang, W., Xia, F., Wu, J., Gong, Z., Tong, H., & Davison, B. D. (2021). Scholar2vec: Vector representation of scholars for lifetime collaborator prediction. ACM Transactions on Knowledge Discovery from Data, 15(3), 1–19. https://doi.org/10.1145/3442199
Article Google Scholar
White, H. D., & McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of information science, 1972–1995. Journal of the American Society for Information Science, 49(4), 327–355. https://doi.org/10.1002/(SICI)1097-4571(19980401)49:4%3c327::AID-ASI4%3e3.0.CO;2-4
Article Google Scholar
Xu, J., Shen, S., Li, D., & Fu, Y. (2018). A network-embedding based method for author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (pp. 1735–1738).
Young, F. W., & Hamer, R. M. (1987). Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum Associates, Inc.
Zhang, Y., & Ma, Q. (2020). DocCit2Vec: Citation recommendation via embedding of content and structural contexts. IEEE Access, 8, 115865–115875. https://doi.org/10.1109/access.2020.3004599
Article Google Scholar

Download references

Funding

This work was partially supported by National Science and Technology Council of the Republic of China (Grant No. MOST 110-2410-H-002-232-MY2).

Author information

Authors and Affiliations

Department of Library and Information Science, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, 10617, Taiwan, ROC
Chien-chih Huang & Kuang-hua Chen

Authors

Chien-chih Huang
View author publications
You can also search for this author inPubMed Google Scholar
Kuang-hua Chen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Kuang-hua Chen and Chien-chih Huang. The first draft of the manuscript was written by Chien-chih Huang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kuang-hua Chen.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huang, Cc., Chen, Kh. RefCit2vec: embedding models considering references and citations for measuring document similarity. Scientometrics 129, 4669–4693 (2024). https://doi.org/10.1007/s11192-024-05067-3

Download citation

Received: 30 April 2023
Accepted: 16 May 2024
Published: 10 July 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s11192-024-05067-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RefCit2vec: embedding models considering references and citations for measuring document similarity

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LD Connect: A Linked Data Portal for IOS Press Scientometrics

Funding map using paragraph embedding based on semantic diversity

Using neural-network based paragraph embeddings for the calculation of within and between document similarities

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now