Skip to main content
Log in

Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Bibliometric analyses depend on the quality of data sets and the author name disambiguation process (ANDP), which attributes author names on papers to real persons. Errors in a data set or the ANDP result in wrongly attributed papers to the wrong person. These errors can potentially distort the results of analyses based on such data sets. However, the general impact of data set quality on bibliometric analysis is mostly unknown; as such, an assessment is costly due to the manual steps involved. This paper presents an overview of the data set qualities produced by different ANDPs and uses simulations to study the general impact of data set quality on different bibliometric analysis (author rankings and regressions analysis with number of papers as dependent variable). The results show that rankings of authors are only valid on high quality data sets, which are typically not found directly in commercially available datasets. Both mean and individual per person data set quality is important for valid ranking results. Regressions are not as influenced by the overall data set quality but instead by individual quality differences between authors. Different types of errors can potentially bias the regression results. The outcome of this study also shows the importance of reporting both overall and individual variation in data set quality, so that the validity of analyses based on these data sets can be assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abramo, G., & D’Angelo, C. A. (2011). Evaluating research: From informed peer review to bibliometrics. Scientometrics, 87(3), 499–514. doi:10.1007/s11192-011-0352-7.

    Article  Google Scholar 

  • Ahmed, Z., & Rahman, A. (2009). Lotka’s Law and Authorship Distribution in Nutrition Research in Bangladesh. Annals of Library and Information Studies, 56(2), 95–102.

    Google Scholar 

  • Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology, 59(5), 838–841. doi:10.1002/asi.20788.

    Article  Google Scholar 

  • Amancio, D. R., Oliveira, O. N, Jr, & Costa, L. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485. doi:10.1007/s11192-014-1381-9.

    Article  Google Scholar 

  • Bedeian, A. G., van Fleet, D. D., & Hyman, H. H. (2009). Scientific achievement and editorial board membership. Organizational Research Methods, 12(2), 211–238. doi:10.1177/1094428107309312.

    Article  Google Scholar 

  • Center for World-Class Universities of Shanghai Jiao Tong University. (2012). Academic Ranking of World Universities2012: Ranking Methodology. Retrieved from http://www.shanghairanking.com/ARWU-Methodology-2012.html.

  • Centra, J. A. (1977). How universities evaluate faculty performance: A survey of department heads. Princeton: Educational Testing Service.

    Google Scholar 

  • Chung, K. H., & Cox, R. A. K. (1990). Patterns of productivity in the finance literature: A study of the bibliometric distributions. The Journal of Finance, 45(1), 301–309.

    Article  Google Scholar 

  • Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661. doi:10.1137/070710111.

    Article  MathSciNet  MATH  Google Scholar 

  • Cortez, P., & Embrechts, M. J. (2013). Using sensitivity analysis and visualization techniques to open black box data mining models. Information Sciences, 225, 1–17. doi:10.1016/j.ins.2012.10.039.

    Article  Google Scholar 

  • Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. doi:10.1002/asi.21363.

    Article  Google Scholar 

  • D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269. doi:10.1002/asi.21460.

    Article  Google Scholar 

  • de Rond, M., & Miller, A. N. (2005). Publish or Perish: Bane or Boon of Academic Life? Journal of Management Inquiry, 14(4), 321–329. doi:10.1177/1056492605276850.

    Article  Google Scholar 

  • Erman, N., & Todorovski, L. (2015). The effects of measurement error in case of scientific network analysis. Scientometrics, 1–21. doi:10.1007/s11192-015-1615-5

  • Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 10:1–10:23. doi:10.1145/1891879.1891883

  • Fenner, M. (2011). Author identifier overview. LIBREAS Library Ideas, 7(1), 24–29.

    Google Scholar 

  • Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15. doi:10.1145/2350036.2350040.

    Article  Google Scholar 

  • Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. doi:10.1002/asi.22992.

    Article  Google Scholar 

  • Frey, B. S. (2003). Publishing as prostitution? Choosing between one’s own ideas and academic success. Public Choice, 116(1/2), 205–223. doi:10.1023/A:1024208701874.

    Article  Google Scholar 

  • Freyer, L. (2014). Robust rankings. Scientometrics, 100(2), 391–406. doi:10.1007/s11192-014-1313-8.

    Article  Google Scholar 

  • Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25(4), 259–264. doi:10.1087/20120404.

    Article  Google Scholar 

  • Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital librariesJCDL ‘05 p. 334. New York: ACM Press.

  • Harrison, R. L. (2010). Introduction to Monte Carlo Simulation. AIP Conference Proceedings, 1204, 17–21. doi:10.1063/1.3295638.

    Article  Google Scholar 

  • Harzing, A.-W., & Mijnhardt, W. (2015). Erratum to: Proof over promise: Towards a more inclusive ranking of Dutch academics in Economics & Business. Scientometrics, 102(1), 751–752. doi:10.1007/s11192-014-1511-4.

    Article  Google Scholar 

  • Henzinger, M., Suñol, J., & Weber, I. (2010). The stability of the h-index. Scientometrics, 84(2), 465–479. doi:10.1007/s11192-009-0098-7.

    Article  Google Scholar 

  • Hicks, D. (2012). Performance-based university research funding systems. Research Policy, 41(2), 251–261. doi:10.1016/j.respol.2011.09.007.

    Article  MathSciNet  Google Scholar 

  • Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429–431. doi:10.1038/520429a.

    Article  Google Scholar 

  • Hönekopp, J., & Khan, J. (2012). Future publication success in science is better predicted by traditional measures than by the h index. Scientometrics, 90(3), 843–853. doi:10.1007/s11192-011-0551-2.

    Article  Google Scholar 

  • Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Lecture notes in computer science. Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer. doi:10.1007/11871637_53

  • Johnson, S. B., Bales, M. E., Dine, D., Bakken, S., Albert, P. J., & Weng, C. (2014). Automatic generation of investigator bibliographies for institutional research networking systems. Journal of Biomedical Informatics. doi:10.1016/j.jbi.2014.03.013

  • Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., & Lee, J.-H. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97. doi:10.1016/j.ipm.2008.06.006.

    Article  Google Scholar 

  • Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics, 103(3), 1061–1071. doi:10.1007/s11192-015-1580-z.

    Article  Google Scholar 

  • Klosik, D. F., Bornholdt, S., & Hütt, M. -T. (2014). Motif-based success scores in coauthorship networks are highly sensitive to author name disambiguation. Physical Review E, 90(3), 032811. doi:10.1103/PhysRevE.90.032811

  • Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. doi:10.1002/asi.22621.

    Article  Google Scholar 

  • Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., & Wilbur, W. John. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. doi:10.1002/asi.23063.

    Article  Google Scholar 

  • Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational & Mathematical Organization Theory, 11(2), 119–139. doi:10.1007/s10588-005-3940-3.

    Article  MATH  Google Scholar 

  • Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125. doi:10.1002/asi.20677.

    Article  Google Scholar 

  • Moed, H. (2012). The use of large datasets in bibliometric research: Presentation at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Retrieved from https://www.youtube.com/watch?v=wCwxux14O04.

  • Moed, H. F., Aisati, M., & Plume, A. (2013). Studying scientific migration in Scopus. Scientometrics, 94(3), 929–942. doi:10.1007/s11192-012-0783-9.

    Article  Google Scholar 

  • Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2), 404–409. doi:10.1073/pnas.021544898.

    Article  MathSciNet  MATH  Google Scholar 

  • Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., & Yamazaki, S. (2011). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62(4), 677–690. doi:10.1002/asi.21491.

    Article  Google Scholar 

  • Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 joint international conference on digital librariesJCDL ‘09 (p. 49). New York: ACM Press.

  • Petersen, A. M., Wang, F., & Stanley, H. E. (2010). Methods for measuring the citations and productivity of scientists across time and discipline. Physical Review E, 81(3), 036114. doi:10.1103/PhysRevE.81.036114.

    Article  MathSciNet  Google Scholar 

  • Reijnhoudt, L., Costas, R., Noyons, E., Börner, K., & Scharnhorst, A. (2014). ‘Seed + expand’: A general methodology for detecting publication oeuvres of individual researchers. Scientometrics, 101(2), 1403–1417. doi:10.1007/s11192-014-1256-0.

    Article  Google Scholar 

  • Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. doi:10.1007/s11192-014-1289-4.

    Article  Google Scholar 

  • Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient Topic-based Unsupervised Name Disambiguation. In JCDL’07, Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 342–351). New York, NY: ACM. doi: 10.1145/1255175.1255243

  • Stringer, M. J., Sales-Pardo, M., & Nunes Amaral, L. A. (2008). Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE, 3(2), e1683 EP. doi:10.1371/journal.pone.0001683

  • Sutter, M., & Kochner, M. (2001). Power laws of research output. Evidence for Journals of Economics. Scientometrics, 51(2), 405–414. doi:10.1023/A:1012757802706.

    Google Scholar 

  • Times Higher Education World. (2012). University Rankings 2012-2013Methodology: The essential elements in our world-leading formula. Retrieved from http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/methodology.

  • Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304.

    Article  Google Scholar 

  • Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi.20105.

    Article  Google Scholar 

  • van den Besselaar, P., Bornmann, L., & Leydesdorff, L. (2014). Correction. Journal of Informetrics, 8(4), 801. doi:10.1016/j.joi.2014.07.008.

    Article  Google Scholar 

  • van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143. doi:10.1007/s11192-005-0008-6.

    Article  Google Scholar 

  • Waltman, L., Calero-Medina, C., Kosten, J., Noyons, E. C. M., Tijssen, R. J. W., van Eck, N. J., & van Raan, A. F. (2012). The Leiden ranking 2011/2012: Data collection, indicators, and interpretation. Journal of the American Society for Information Science and Technology, 63(12), 2419–2432. doi:10.1002/asi.22708.

    Article  Google Scholar 

  • Wang, W., Neuman, E. J., & Newman, D. A. (2014). Statistical power of the social network autocorrelation model. Social Networks, 38, 88–99. doi:10.1016/j.socnet.2014.03.004.

    Article  Google Scholar 

  • Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012a). A boosted-trees method for name disambiguation: Scientometrics, 93(2), 391–411. doi:10.1007/s11192-012-0681-1.

    Google Scholar 

  • Wang, D. J., Shi, X., McFarland, D. A., & Leskovec, J. (2012b). Measurement error in network data: A re-classification. Social Networks,. doi:10.1016/j.socnet.2012.01.003.

    Google Scholar 

  • Wang, F., Yang, Y., Ma, Z., & Li, L. (2013). A three-stage clustering framework based on multiple feature combination for chinese person name disambiguation. Information Science and Cloud Computing Companion,. doi:10.1109/ISCC-C.2013.33.

    Article  Google Scholar 

  • Wu, J., & Ding, X.-H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697. doi:10.1007/s11192-013-0978-8.

    Article  MathSciNet  Google Scholar 

  • Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972. doi:10.1007/s11192-014-1283-x.

    Article  Google Scholar 

  • Xu, F., Li, X. X., Meng, W., Liu, W. B., & Mingers, J. (2013). Ranking academic impact of world national research institutes–by the Chinese Academy of Sciences. Research Evaluation, 22(5), 337–350. doi:10.1093/reseval/rvt007.

    Article  Google Scholar 

  • Zhu, J., Yang, Y., Xie, Q., Wang, L., & Hassan, S.-U. (2014). Robust hybrid name disambiguation framework for large databases. Scientometrics, 98(3), 2255–2274. doi:10.1007/s11192-013-1151-0.

    Article  Google Scholar 

Download references

Acknowledgments

An earlier version of this article was presented at the 14th International Society of Scientometrics and Informetrics Conference, Vienna, Austria, 15th–19th July 2013.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schulz, J. Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics 107, 1283–1298 (2016). https://doi.org/10.1007/s11192-016-1892-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-016-1892-7

Keywords

Navigation