Abstract
Definitions for influence in bibliometrics are surveyed and expanded upon in this work. On data composed of the union of DBLP and CiteSeerx, approximately 6 million publications, a relatively small number of features are developed to describe the set, including loyalty and community longevity, two novel features. These features are successfully used to predict the influential set of papers in a series of machine learning experiments. The most predictive features are highlighted and discussed.
Similar content being viewed by others
References
Bollacker, K. D., Lawrence, S., & Giles, C. L. (1998). CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Proceedings of the second international conference on Autonomous agents (pp. 116–123).
Catalini, C., Lacetera, N., & Oettl, A. (2015). The incidence and role of negative citations in science. Proceedings of the National Academy of Sciences, 112(45), 13823–13826.
Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152.
Giles, C. L., Bollacker, K. D., & Lawrence, S. (1998). CiteSeer: An automatic citation indexing system. In Proceedings of the third ACM conference on digital libraries (pp. 89–98).
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J., et al. (2008). What makes an article influential? Predicting impact in social and personality psychology. Scientometrics, 76(1), 169–185.
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.
Hirsch, J. E. (2007). Does the h index have predictive power? Proceedings of the National Academy of Sciences, 104(49), 19193–19198.
Judge, T. A., Cable, D. M., Colbert, A. E., & Rynes, S. L. (2007). What causes a management article to be citedarticle, author, or journal? Academy of Management Journal, 50(3), 491–506.
Lawrence, D. F. U., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85(1), 257–270.
Ley, M. (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In String processing and information retrieval (pp. 1–10).
Lotka, A. J. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317–323.
Merton, R. K. (1968). The Matthew effect in science. Science, 159(3810), 56–63.
Mitra, P. (2006). Hirsch-type indices for ranking institutions scientific research output. Current Science, 91(11), 1439.
Newman, M. E. J. (2009). The first-mover advantage in scientific publication. EPL (Europhysics Letters), 86(6), 68001.
Newman, M. E. J. (2014). Prediction of highly cited papers. EPL (Europhysics Letters), 105(2), 28002.
Price, D. J. de Solla (1965). Networks of scientific papers. Science, 149(3683), 510–515.
Rossiter, M. W. (1993). The Matthew Matilda effect in science. Social Studies of Science, 23(2), 325–341.
Schubert, A., Korn, A., & Telcs, A. (2008). Hirsch-type indices for characterizing networks. Scientometrics, 78(2), 375–382.
Sher, I. H., & Garfield, E. (1965). New tools for improving and evaluating the effectiveness of research. In Research program effectiveness, proceedings of the conference sponsored by the Office of Naval Research, Washington, DC (pp. 135–146).
Shi, X., Tseng, B., & Adamic, L. A. (2009). Information diffusion in computer science citation networks. arXiv preprint arXiv:0905.2636.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525–526).
Tscharntke, T., Hochberg, M. E., Rand, T. A., Resh, V. H., & Krauss, J. (2007). Author sequence and credit for contributions in multiauthored publications. PLoS Biol, 5(1), e18.
Van Dalen, H. P., & Henkens, K. (2001). What makes a scientific article influential? The case of demographers. Scientometrics, 50(3), 455–482.
Van Raan, A. F. J. (2004). Sleeping beauties in science. Scientometrics, 59(3), 467–472.
Acknowledgments
This research was supported, in part, under National Science Foundation Grants CNS-0958379, CNS-0855217, ACI-1126113 and the City University of New York High Performance Computing Center at the College of Staten Island. The authors also acknowledge the Office of Information Technology at The Graduate Center, CUNY for providing database and server resources that have contributed to the research results reported within this paper. URL: http://it.gc.cuny.edu/.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Features
Table 2 lists all 48 features used in our system. We consider different functionals (example, min or max of a set of numbers) to be different features.
Appendix 2: Clustering performance
Table 3 shows the different aliases for the Quantitative Evaluation of Systems (QEST) conference. We chose this conference because of its relatively small number of entries but its relatively high number of aliases. ID numbers uniquely identify an alias within our database. Lines separate clusters of aliases. Note that there is one large cluster of 15 aliases and many clusters with a single alias.
Because none of the clusters have aliases belonging to other conferences, the purity of each cluster and of the set of clusters is 1.0. The entropy of this set of clusters is 0.0269, slightly higher than that of the other communities we sampled (0.0231).
Rights and permissions
About this article
Cite this article
Brizan, D.G., Gallagher, K., Jahangir, A. et al. Predicting citation patterns: defining and determining influence. Scientometrics 108, 183–200 (2016). https://doi.org/10.1007/s11192-016-1950-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-016-1950-1