Skip to main content
Log in

A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

We measure the knowledge flows between countries by analysing publication and citation data, arguing that not all citations are equally important. Therefore, in contrast to existing techniques that utilize absolute citation counts to quantify knowledge flows between different entities, our model employs a citation context analysis technique, using a machine-learning approach to distinguish between important and non-important citations. We use 14 novel features (including context-based, cue words-based and text-based) to train a Support Vector Machine (SVM) and Random Forest classifier on an annotated dataset of 20,527 publications downloaded from the Association for Computational Linguistics anthology (http://allenai.org/data.html). Our machine-learning models outperform existing state-of-the-art citation context approaches, with the SVM model reaching up to 61% and the Random Forest model up to a very encouraging 90% Precision–Recall Area Under the Curve, with 10-fold cross-validation. Finally, we present a case study to explain our deployed method for datasets of PLoS ONE full-text publications in the field of Computer and Information Sciences. Our results show that a significant volume of knowledge flows from the United States, based on important citations, are consumed by the international scientific community. Of the total knowledge flow from China, we find a relatively smaller proportion (only 4.11%) falling into the category of knowledge flow based on important citations, while The Netherlands and Germany show the highest proportions of knowledge flows based on important citations, at 9.06 and 7.35% respectively. Among the institutions, interestingly, the findings show that at the University of Malaya more than 10% of the knowledge produced falls into the category of important. We believe that such analyses are helpful to understand the dynamics of the relevant knowledge flows across nations and institutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Bonzi, S. (1982). Characteristics of a literature as predictors of relatedness between cited and citing works. Journal of the American Society for Information Science, 33(4), 208–216.

    Article  Google Scholar 

  • Borgman, C. L. (1990). Scholarly communication and bibliometrics. Thousand Oaks, CA: Sage.

    Google Scholar 

  • Borgman, C. L., & Rice, R. E. (1992). The convergence of information science and communication: A bibliometric analysis. Journal of the American Society for Information Science, 43(6), 397.

    Article  Google Scholar 

  • Börner, K., Penumarthy, S., Meiss, M., & Ke, W. (2006). Mapping the diffusion of scholarly knowledge among major US research institutions. Scientometrics, 68(3), 415–426.

    Article  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Chubin, D. E., & Moitra, S. (1975). Content analysis of references: Adjunct or alternative to citation counting? Social Studies of Science, 5(4), 423–441.

    Article  Google Scholar 

  • Garzone, M. A. (1997). Automated classification of citations using linguistic semantic grammars. Doctoral dissertation, University of Western Ontario, London, ON.

  • Guevara, M. R., Hartmann, D., Aristarán, M., Mendoza, M., & Hidalgo, C. A. (2016). The research space: Using career paths to predict the evolution of the research output of individuals, institutions, and nations. Scientometrics, 109(3), 1695–1709.

    Article  Google Scholar 

  • Hagel, J., Brown, S. J., Kulasooriya, D., & Elbert, D. (2010). Measuring the forces of long-term change: The 2010 shift index. Deloitte Center for the Edge, 2.

  • Hassan, S. U., Akram, A., Asghar, A., & Aljohani, N. F. (2017a). Measuring scientific knowledge flows by deploying citation context analysis using machine learning approach on PLoS ONE full text. In 16th international conference in scientometrics and infometrics (pp. 322–333), Wuhan, China.

  • Hassan, S. U., Akram, A., & Haddawy, P. (2017b). Identifying important citations using contextual information from full text. In Joint international conference on digital libraries, Ontario, Canada.

  • Hassan, S. U., & Haddawy, P. (2013). Measuring international knowledge flows and scholarly impact of scientific research. Scientometrics, 94(1), 163–179.

    Article  Google Scholar 

  • Hassan, S. U., & Haddawy, P. (2015a). Analyzing knowledge flows of scientific literature through semantic links: A case study in the field of energy. Scientometrics, 103(1), 33–46.

    Article  Google Scholar 

  • Hassan, S. U. & Haddawy, P. (2015b) Tapping into scientific knowledge flows via semantic links. In 15th international conference in scientometrics and infometrics, Istanbul, Turkey.

  • Hicks, D., Breitzman, T., Olivastro, D., & Hamilton, K. (2001). The changing composition of innovative activity in the US—A portrait based on patent analysis. Research Policy, 30(4), 681–703.

    Article  Google Scholar 

  • Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896.

    Article  Google Scholar 

  • Hu, A. G., & Jaffe, A. B. (2003). Patent citations and international knowledge flow: The cases of Korea and Taiwan. International Journal of Industrial Organization, 21(6), 849–880.

    Article  Google Scholar 

  • Ingwersen, P., Larsen, B., & Wormell, I. (2000). Applying diachronic citation analysis to ongoing research program evaluations. In B. Cronin & H. B. Atkins (Eds.), The web of knowledge: A Festschrift in honor of Eugene Garfield. Medford, NJ: Information Today Inc. & The American Society for Information Science.

  • Jaffe, A. B., Trajtenberg, M., & Henderson, R. (1993). Geographic localization of knowledge spillovers as evidenced by patent citations. Quarterly Journal of Economics, 108(3), 577–598.

    Article  Google Scholar 

  • Khasseh, A. A., Soheili, F., Moghaddam, H. S., & Chelak, A. M. (2017). Intellectual structure of knowledge in iMetrics: A co-word analysis. Information Processing and Management, 53(3), 705–720.

    Article  Google Scholar 

  • Leydesdorff, L., & Probst, C. (2009). The delineation of an interdisciplinary specialty in terms of a journal set: The case of communication studies. Journal of the American Society for Information Science and Technology, 60(8), 1709–1718.

    Article  Google Scholar 

  • Liu, S., & Chen, C. (2011). The proximity of co-citation. Scientometrics, 91(2), 495–511.

    Article  MathSciNet  Google Scholar 

  • Liu, X., Jiang, S., Chen, H., Larson, C. A., & Roco, M. C. (2014). Nanotechnology knowledge diffusion: Measuring the impact of the research networking and a strategy for improvement. Journal of Nanoparticle Research, 16(9), 1–15.

    Article  Google Scholar 

  • Lockett, A., & McWilliams, A. (2005). The balance of trade between disciplines: Do we effectively manage knowledge? Journal of Management Inquiry, 14(2), 139–150.

    Article  Google Scholar 

  • Luo, X., Xu, Z., Li, Q., Hu, Q., Yu, J., & Tang, X. (2009). Generation of similarity knowledge flow for intelligent browsing based on semantic link networks. Concurrency and Computation: Practice and Experience, 21(16), 2018–2032.

    Article  Google Scholar 

  • Luo, X., Yu, J., Li, Q., Liu, F., & Xu, Z. (2010). Building web knowledge flows based on interactive computing with semantics. New Generation Computing, 28(2), 113–120.

    Article  MATH  Google Scholar 

  • Luukkonen, T. (1992). Is scientists’ publishing behaviour reward-seeking? Scientometrics, 24, 297–319.

    Article  Google Scholar 

  • Mete, M. V., & Deshmukh, P. P. (1996). Citation analysis of annals of library science and documentation. Annuals of Library Science and Documentation, 42(3), 11–25.

    Google Scholar 

  • Meyer, M. (2002). Tracing knowledge flows in innovation systems—An informetric perspective on future research science-based innovation. Economic Systems Research, 14(4), 323–344.

    Article  Google Scholar 

  • Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5(1), 86–92.

    Article  Google Scholar 

  • Oppenheim, C., & Renn, S. P. (1978). Highly cited old papers and the reasons why they continue to be cited. Journal of the American Society for Information Science, 29(5), 227–231.

    Article  Google Scholar 

  • Patel, P. (1998). Indicators for systems of innovation and system interactions: Technological collaboration and interactive learning, IDEA report 11/1998. Oslo: STEP.

    Google Scholar 

  • Ponomariov, B., & Toivanen, H. (2014). Knowledge flows and bases in emerging economy innovation systems: Brazilian research 2005–2009. Research Policy, 43(3), 588–596.

    Article  Google Scholar 

  • Ribeiro, L. C., Kruss, G., Britto, G., Bernardes, A. T., & Albuquerque, E. D. M. (2014). A methodology for unveiling global innovation networks: Patent citations as clues to cross border knowledge flows. Scientometrics, 101(1), 61–83.

    Article  Google Scholar 

  • Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4), 1118–1123.

    Article  Google Scholar 

  • Schott, T. (1994). Collaboration in the invention of technology: Globalization, regions, and centers. Social Science Research, 23(1), 23–56.

    Article  Google Scholar 

  • Small, H. (1986). The synthesis of specialty narratives from co-citation clusters. Journal of the American Society for information Science, 37(3), 97.

    Article  Google Scholar 

  • Stigler, S. M. (1994). Citation patterns in the journals of statistics and probability. Statistical Science, 9, 94–108.

    Article  Google Scholar 

  • Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300.

    Article  Google Scholar 

  • Teufel, S., Siddharthan, A., & Tidhar, D. (2006). Automatic classification of citation function. Automatic classification. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 103–110). Association for Computational Linguistics.

  • Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.

  • Yan, E. (2015). Research dynamics, impact, and dissemination: A topic-level analysis. Journal of the Association for Information Science and Technology, 66(11), 2357–2372.

    Article  Google Scholar 

  • Yan, E. (2016). Disciplinary knowledge production and diffusion in science. Journal of the Association for Information Science and Technology, 67(9), 2223–2245.

    Article  Google Scholar 

  • Yan, E., Ding, Y., Cronin, B., & Leydesdorff, L. (2013). A bird’s-eye view of scientific trading: Dependency relations among fields of science. Journal of Informetrics, 7(2), 249–264.

    Article  Google Scholar 

  • Yan, E., & Sugimoto, C. R. (2011). Institutional interactions: Exploring social, cognitive, and geographic relationships between institutions as demonstrated through citation networks. Journal of the American Society for Information Science and Technology, 62(8), 1498–1514.

    Article  Google Scholar 

  • Yang, S., & Wang, F. (2015). Visualizing information science: Author direct citation analysis in China and around the world. Journal of Informetrics, 9(1), 208–225.

    Article  Google Scholar 

  • Zhang, Q., Ciulla, F., Goncalves, B., Perra, N., & Vespignani, A. (2013). Characterizing production and consumption in physics. APS Meeting Abstracts, 1, 28001.

    Google Scholar 

  • Zhuge, H. (2006). Discovery of knowledge flow in science. Communications of the ACM, 49(5), 101–107.

    Article  Google Scholar 

  • Zhuge, H. (2009). Communities and emerging semantics in semantic link network: Discovery and learning. IEEE Transactions on Knowledge and Data Engineering, 21(6), 785–799.

    Article  MathSciNet  Google Scholar 

  • Zhuge, H. (2010). Interactive semantics. Artificial Intelligence, 174(2), 190–204.

    Article  Google Scholar 

  • Zhuge, H. (2011). Semantic linking through spaces for cyber-physical-socio intelligence: A methodology. Artificial Intelligence, 175(5), 988–1019.

    Article  Google Scholar 

  • Zhuge, H., Ma, J., & Shi, X. (1997). Analogy and abstract in cognitive space: A software process model. Information and Software Technology, 39, 463–468.

    Article  Google Scholar 

  • Ziman, J. M. (1968). Public knowledge: An essay concerning the social dimension of science (Vol. 519). CUP Archive.

Download references

Acknowledgements

The present study is an extended version of an article presented at the 16th International Conference on Scientometrics and Informetrics, Wuhan (China), 16–20 October 2017 (Hassan et al. 2017a).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed-Ul Hassan.

Appendix

Appendix

See Tables 6, 7, 8 and 9.

Table 6 List of Cue terms used for the classification of non-important and important citations
Table 7 Sample PLoS dataset examples. The cited articles are marked as important [1] and non-important [0] class
Table 8 Distribution of references published by authors across countries, cited in papers published by authors from other countries
Table 9 Distribution of references published by authors across institutions, cited in papers published by authors from other institutions

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hassan, SU., Safder, I., Akram, A. et al. A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics 116, 973–996 (2018). https://doi.org/10.1007/s11192-018-2767-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-018-2767-x

Keywords

Navigation