Abstract
Bibliometric analyses depend on the quality of data sets and the author name disambiguation process (ANDP), which attributes author names on papers to real persons. Errors in a data set or the ANDP result in wrongly attributed papers to the wrong person. These errors can potentially distort the results of analyses based on such data sets. However, the general impact of data set quality on bibliometric analysis is mostly unknown; as such, an assessment is costly due to the manual steps involved. This paper presents an overview of the data set qualities produced by different ANDPs and uses simulations to study the general impact of data set quality on different bibliometric analysis (author rankings and regressions analysis with number of papers as dependent variable). The results show that rankings of authors are only valid on high quality data sets, which are typically not found directly in commercially available datasets. Both mean and individual per person data set quality is important for valid ranking results. Regressions are not as influenced by the overall data set quality but instead by individual quality differences between authors. Different types of errors can potentially bias the regression results. The outcome of this study also shows the importance of reporting both overall and individual variation in data set quality, so that the validity of analyses based on these data sets can be assessed.
Similar content being viewed by others
References
Abramo, G., & D’Angelo, C. A. (2011). Evaluating research: From informed peer review to bibliometrics. Scientometrics, 87(3), 499–514. doi:10.1007/s11192-011-0352-7.
Ahmed, Z., & Rahman, A. (2009). Lotka’s Law and Authorship Distribution in Nutrition Research in Bangladesh. Annals of Library and Information Studies, 56(2), 95–102.
Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology, 59(5), 838–841. doi:10.1002/asi.20788.
Amancio, D. R., Oliveira, O. N, Jr, & Costa, L. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485. doi:10.1007/s11192-014-1381-9.
Bedeian, A. G., van Fleet, D. D., & Hyman, H. H. (2009). Scientific achievement and editorial board membership. Organizational Research Methods, 12(2), 211–238. doi:10.1177/1094428107309312.
Center for World-Class Universities of Shanghai Jiao Tong University. (2012). Academic Ranking of World Universities—2012: Ranking Methodology. Retrieved from http://www.shanghairanking.com/ARWU-Methodology-2012.html.
Centra, J. A. (1977). How universities evaluate faculty performance: A survey of department heads. Princeton: Educational Testing Service.
Chung, K. H., & Cox, R. A. K. (1990). Patterns of productivity in the finance literature: A study of the bibliometric distributions. The Journal of Finance, 45(1), 301–309.
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661. doi:10.1137/070710111.
Cortez, P., & Embrechts, M. J. (2013). Using sensitivity analysis and visualization techniques to open black box data mining models. Information Sciences, 225, 1–17. doi:10.1016/j.ins.2012.10.039.
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. doi:10.1002/asi.21363.
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269. doi:10.1002/asi.21460.
de Rond, M., & Miller, A. N. (2005). Publish or Perish: Bane or Boon of Academic Life? Journal of Management Inquiry, 14(4), 321–329. doi:10.1177/1056492605276850.
Erman, N., & Todorovski, L. (2015). The effects of measurement error in case of scientific network analysis. Scientometrics, 1–21. doi:10.1007/s11192-015-1615-5
Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 10:1–10:23. doi:10.1145/1891879.1891883
Fenner, M. (2011). Author identifier overview. LIBREAS Library Ideas, 7(1), 24–29.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15. doi:10.1145/2350036.2350040.
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. doi:10.1002/asi.22992.
Frey, B. S. (2003). Publishing as prostitution? Choosing between one’s own ideas and academic success. Public Choice, 116(1/2), 205–223. doi:10.1023/A:1024208701874.
Freyer, L. (2014). Robust rankings. Scientometrics, 100(2), 391–406. doi:10.1007/s11192-014-1313-8.
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25(4), 259–264. doi:10.1087/20120404.
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries—JCDL ‘05 p. 334. New York: ACM Press.
Harrison, R. L. (2010). Introduction to Monte Carlo Simulation. AIP Conference Proceedings, 1204, 17–21. doi:10.1063/1.3295638.
Harzing, A.-W., & Mijnhardt, W. (2015). Erratum to: Proof over promise: Towards a more inclusive ranking of Dutch academics in Economics & Business. Scientometrics, 102(1), 751–752. doi:10.1007/s11192-014-1511-4.
Henzinger, M., Suñol, J., & Weber, I. (2010). The stability of the h-index. Scientometrics, 84(2), 465–479. doi:10.1007/s11192-009-0098-7.
Hicks, D. (2012). Performance-based university research funding systems. Research Policy, 41(2), 251–261. doi:10.1016/j.respol.2011.09.007.
Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429–431. doi:10.1038/520429a.
Hönekopp, J., & Khan, J. (2012). Future publication success in science is better predicted by traditional measures than by the h index. Scientometrics, 90(3), 843–853. doi:10.1007/s11192-011-0551-2.
Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Lecture notes in computer science. Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer. doi:10.1007/11871637_53
Johnson, S. B., Bales, M. E., Dine, D., Bakken, S., Albert, P. J., & Weng, C. (2014). Automatic generation of investigator bibliographies for institutional research networking systems. Journal of Biomedical Informatics. doi:10.1016/j.jbi.2014.03.013
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., & Lee, J.-H. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97. doi:10.1016/j.ipm.2008.06.006.
Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics, 103(3), 1061–1071. doi:10.1007/s11192-015-1580-z.
Klosik, D. F., Bornholdt, S., & Hütt, M. -T. (2014). Motif-based success scores in coauthorship networks are highly sensitive to author name disambiguation. Physical Review E, 90(3), 032811. doi:10.1103/PhysRevE.90.032811
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. doi:10.1002/asi.22621.
Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., & Wilbur, W. John. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. doi:10.1002/asi.23063.
Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational & Mathematical Organization Theory, 11(2), 119–139. doi:10.1007/s10588-005-3940-3.
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125. doi:10.1002/asi.20677.
Moed, H. (2012). The use of large datasets in bibliometric research: Presentation at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Retrieved from https://www.youtube.com/watch?v=wCwxux14O04.
Moed, H. F., Aisati, M., & Plume, A. (2013). Studying scientific migration in Scopus. Scientometrics, 94(3), 929–942. doi:10.1007/s11192-012-0783-9.
Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2), 404–409. doi:10.1073/pnas.021544898.
Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., & Yamazaki, S. (2011). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62(4), 677–690. doi:10.1002/asi.21491.
Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 joint international conference on digital libraries—JCDL ‘09 (p. 49). New York: ACM Press.
Petersen, A. M., Wang, F., & Stanley, H. E. (2010). Methods for measuring the citations and productivity of scientists across time and discipline. Physical Review E, 81(3), 036114. doi:10.1103/PhysRevE.81.036114.
Reijnhoudt, L., Costas, R., Noyons, E., Börner, K., & Scharnhorst, A. (2014). ‘Seed + expand’: A general methodology for detecting publication oeuvres of individual researchers. Scientometrics, 101(2), 1403–1417. doi:10.1007/s11192-014-1256-0.
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. doi:10.1007/s11192-014-1289-4.
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient Topic-based Unsupervised Name Disambiguation. In JCDL’07, Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 342–351). New York, NY: ACM. doi: 10.1145/1255175.1255243
Stringer, M. J., Sales-Pardo, M., & Nunes Amaral, L. A. (2008). Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE, 3(2), e1683 EP. doi:10.1371/journal.pone.0001683
Sutter, M., & Kochner, M. (2001). Power laws of research output. Evidence for Journals of Economics. Scientometrics, 51(2), 405–414. doi:10.1023/A:1012757802706.
Times Higher Education World. (2012). University Rankings 2012-2013—Methodology: The essential elements in our world-leading formula. Retrieved from http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/methodology.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi.20105.
van den Besselaar, P., Bornmann, L., & Leydesdorff, L. (2014). Correction. Journal of Informetrics, 8(4), 801. doi:10.1016/j.joi.2014.07.008.
van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143. doi:10.1007/s11192-005-0008-6.
Waltman, L., Calero-Medina, C., Kosten, J., Noyons, E. C. M., Tijssen, R. J. W., van Eck, N. J., & van Raan, A. F. (2012). The Leiden ranking 2011/2012: Data collection, indicators, and interpretation. Journal of the American Society for Information Science and Technology, 63(12), 2419–2432. doi:10.1002/asi.22708.
Wang, W., Neuman, E. J., & Newman, D. A. (2014). Statistical power of the social network autocorrelation model. Social Networks, 38, 88–99. doi:10.1016/j.socnet.2014.03.004.
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012a). A boosted-trees method for name disambiguation: Scientometrics, 93(2), 391–411. doi:10.1007/s11192-012-0681-1.
Wang, D. J., Shi, X., McFarland, D. A., & Leskovec, J. (2012b). Measurement error in network data: A re-classification. Social Networks,. doi:10.1016/j.socnet.2012.01.003.
Wang, F., Yang, Y., Ma, Z., & Li, L. (2013). A three-stage clustering framework based on multiple feature combination for chinese person name disambiguation. Information Science and Cloud Computing Companion,. doi:10.1109/ISCC-C.2013.33.
Wu, J., & Ding, X.-H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697. doi:10.1007/s11192-013-0978-8.
Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972. doi:10.1007/s11192-014-1283-x.
Xu, F., Li, X. X., Meng, W., Liu, W. B., & Mingers, J. (2013). Ranking academic impact of world national research institutes–by the Chinese Academy of Sciences. Research Evaluation, 22(5), 337–350. doi:10.1093/reseval/rvt007.
Zhu, J., Yang, Y., Xie, Q., Wang, L., & Hassan, S.-U. (2014). Robust hybrid name disambiguation framework for large databases. Scientometrics, 98(3), 2255–2274. doi:10.1007/s11192-013-1151-0.
Acknowledgments
An earlier version of this article was presented at the 14th International Society of Scientometrics and Informetrics Conference, Vienna, Austria, 15th–19th July 2013.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Schulz, J. Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics 107, 1283–1298 (2016). https://doi.org/10.1007/s11192-016-1892-7
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-016-1892-7