Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Schulz, Jan

doi:10.1007/s11192-016-1892-7

Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Published: 23 February 2016

Volume 107, pages 1283–1298, (2016)
Cite this article

Scientometrics Aims and scope Submit manuscript

Jan Schulz ORCID: orcid.org/0000-0002-2312-0588¹

758 Accesses
24 Citations
2 Altmetric
Explore all metrics

Abstract

Bibliometric analyses depend on the quality of data sets and the author name disambiguation process (ANDP), which attributes author names on papers to real persons. Errors in a data set or the ANDP result in wrongly attributed papers to the wrong person. These errors can potentially distort the results of analyses based on such data sets. However, the general impact of data set quality on bibliometric analysis is mostly unknown; as such, an assessment is costly due to the manual steps involved. This paper presents an overview of the data set qualities produced by different ANDPs and uses simulations to study the general impact of data set quality on different bibliometric analysis (author rankings and regressions analysis with number of papers as dependent variable). The results show that rankings of authors are only valid on high quality data sets, which are typically not found directly in commercially available datasets. Both mean and individual per person data set quality is important for valid ranking results. Regressions are not as influenced by the overall data set quality but instead by individual quality differences between authors. Different types of errors can potentially bias the regression results. The outcome of this study also shows the importance of reporting both overall and individual variation in data set quality, so that the validity of analyses based on these data sets can be assessed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of bibliometric measures for assessing relative importance of researchers

Article 02 June 2015

Deep and narrow impact: introducing location filtered citation counting

Article 15 November 2019

Hierarchical Modeling Approaches for Generating Author Blueprints

References

Abramo, G., & D’Angelo, C. A. (2011). Evaluating research: From informed peer review to bibliometrics. Scientometrics, 87(3), 499–514. doi:10.1007/s11192-011-0352-7.
Article Google Scholar
Ahmed, Z., & Rahman, A. (2009). Lotka’s Law and Authorship Distribution in Nutrition Research in Bangladesh. Annals of Library and Information Studies, 56(2), 95–102.
Google Scholar
Aksnes, D. W. (2008). When different persons have an identical author name. How frequent are homonyms? Journal of the American Society for Information Science and Technology, 59(5), 838–841. doi:10.1002/asi.20788.
Article Google Scholar
Amancio, D. R., Oliveira, O. N, Jr, & Costa, L. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1), 465–485. doi:10.1007/s11192-014-1381-9.
Article Google Scholar
Bedeian, A. G., van Fleet, D. D., & Hyman, H. H. (2009). Scientific achievement and editorial board membership. Organizational Research Methods, 12(2), 211–238. doi:10.1177/1094428107309312.
Article Google Scholar
Center for World-Class Universities of Shanghai Jiao Tong University. (2012). Academic Ranking of World Universities—2012: Ranking Methodology. Retrieved from http://www.shanghairanking.com/ARWU-Methodology-2012.html.
Centra, J. A. (1977). How universities evaluate faculty performance: A survey of department heads. Princeton: Educational Testing Service.
Google Scholar
Chung, K. H., & Cox, R. A. K. (1990). Patterns of productivity in the finance literature: A study of the bibliometric distributions. The Journal of Finance, 45(1), 301–309.
Article Google Scholar
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. SIAM Review, 51(4), 661. doi:10.1137/070710111.
Article MathSciNet MATH Google Scholar
Cortez, P., & Embrechts, M. J. (2013). Using sensitivity analysis and visualization techniques to open black box data mining models. Information Sciences, 225, 1–17. doi:10.1016/j.ins.2012.10.039.
Article Google Scholar
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870. doi:10.1002/asi.21363.
Article Google Scholar
D’Angelo, C. A., Giuffrida, C., & Abramo, G. (2011). A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments. Journal of the American Society for Information Science and Technology, 62(2), 257–269. doi:10.1002/asi.21460.
Article Google Scholar
de Rond, M., & Miller, A. N. (2005). Publish or Perish: Bane or Boon of Academic Life? Journal of Management Inquiry, 14(4), 321–329. doi:10.1177/1056492605276850.
Article Google Scholar
Erman, N., & Todorovski, L. (2015). The effects of measurement error in case of scientific network analysis. Scientometrics, 1–21. doi:10.1007/s11192-015-1615-5
Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 10:1–10:23. doi:10.1145/1891879.1891883
Fenner, M. (2011). Author identifier overview. LIBREAS Library Ideas, 7(1), 24–29.
Google Scholar
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15. doi:10.1145/2350036.2350040.
Article Google Scholar
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2014). Self-training author name disambiguation for information scarce scenarios. Journal of the Association for Information Science and Technology, 65(6), 1257–1278. doi:10.1002/asi.22992.
Article Google Scholar
Frey, B. S. (2003). Publishing as prostitution? Choosing between one’s own ideas and academic success. Public Choice, 116(1/2), 205–223. doi:10.1023/A:1024208701874.
Article Google Scholar
Freyer, L. (2014). Robust rankings. Scientometrics, 100(2), 391–406. doi:10.1007/s11192-014-1313-8.
Article Google Scholar
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25(4), 259–264. doi:10.1087/20120404.
Article Google Scholar
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries—JCDL ‘05 p. 334. New York: ACM Press.
Harrison, R. L. (2010). Introduction to Monte Carlo Simulation. AIP Conference Proceedings, 1204, 17–21. doi:10.1063/1.3295638.
Article Google Scholar
Harzing, A.-W., & Mijnhardt, W. (2015). Erratum to: Proof over promise: Towards a more inclusive ranking of Dutch academics in Economics & Business. Scientometrics, 102(1), 751–752. doi:10.1007/s11192-014-1511-4.
Article Google Scholar
Henzinger, M., Suñol, J., & Weber, I. (2010). The stability of the h-index. Scientometrics, 84(2), 465–479. doi:10.1007/s11192-009-0098-7.
Article Google Scholar
Hicks, D. (2012). Performance-based university research funding systems. Research Policy, 41(2), 251–261. doi:10.1016/j.respol.2011.09.007.
Article MathSciNet Google Scholar
Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429–431. doi:10.1038/520429a.
Article Google Scholar
Hönekopp, J., & Khan, J. (2012). Future publication success in science is better predicted by traditional measures than by the h index. Scientometrics, 90(3), 843–853. doi:10.1007/s11192-011-0551-2.
Article Google Scholar
Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), Lecture notes in computer science. Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer. doi:10.1007/11871637_53
Johnson, S. B., Bales, M. E., Dine, D., Bakken, S., Albert, P. J., & Weng, C. (2014). Automatic generation of investigator bibliographies for institutional research networking systems. Journal of Biomedical Informatics. doi:10.1016/j.jbi.2014.03.013
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., & Lee, J.-H. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97. doi:10.1016/j.ipm.2008.06.006.
Article Google Scholar
Kawashima, H., & Tomizawa, H. (2015). Accuracy evaluation of Scopus Author ID based on the largest funding database in Japan. Scientometrics, 103(3), 1061–1071. doi:10.1007/s11192-015-1580-z.
Article Google Scholar
Klosik, D. F., Bornholdt, S., & Hütt, M. -T. (2014). Motif-based success scores in coauthorship networks are highly sensitive to author name disambiguation. Physical Review E, 90(3), 032811. doi:10.1103/PhysRevE.90.032811
Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D. (2012). Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5), 1030–1047. doi:10.1002/asi.22621.
Article Google Scholar
Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., & Wilbur, W. John. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology, 65(4), 765–781. doi:10.1002/asi.23063.
Article Google Scholar
Malin, B., Airoldi, E., & Carley, K. M. (2005). A network analysis model for disambiguation of names in lists. Computational & Mathematical Organization Theory, 11(2), 119–139. doi:10.1007/s10588-005-3940-3.
Article MATH Google Scholar
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125. doi:10.1002/asi.20677.
Article Google Scholar
Moed, H. (2012). The use of large datasets in bibliometric research: Presentation at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Retrieved from https://www.youtube.com/watch?v=wCwxux14O04.
Moed, H. F., Aisati, M., & Plume, A. (2013). Studying scientific migration in Scopus. Scientometrics, 94(3), 929–942. doi:10.1007/s11192-012-0783-9.
Article Google Scholar
Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2), 404–409. doi:10.1073/pnas.021544898.
Article MathSciNet MATH Google Scholar
Onodera, N., Iwasawa, M., Midorikawa, N., Yoshikane, F., Amano, K., Ootani, Y., & Yamazaki, S. (2011). A method for eliminating articles by homonymous authors from the large number of articles retrieved by author search. Journal of the American Society for Information Science and Technology, 62(4), 677–690. doi:10.1002/asi.21491.
Article Google Scholar
Pereira, D. A., Ribeiro-Neto, B., Ziviani, N., Laender, A. H., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 joint international conference on digital libraries—JCDL ‘09 (p. 49). New York: ACM Press.
Petersen, A. M., Wang, F., & Stanley, H. E. (2010). Methods for measuring the citations and productivity of scientists across time and discipline. Physical Review E, 81(3), 036114. doi:10.1103/PhysRevE.81.036114.
Article MathSciNet Google Scholar
Reijnhoudt, L., Costas, R., Noyons, E., Börner, K., & Scharnhorst, A. (2014). ‘Seed + expand’: A general methodology for detecting publication oeuvres of individual researchers. Scientometrics, 101(2), 1403–1417. doi:10.1007/s11192-014-1256-0.
Article Google Scholar
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics, 100(1), 15–50. doi:10.1007/s11192-014-1289-4.
Article Google Scholar
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient Topic-based Unsupervised Name Disambiguation. In JCDL’07, Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries (pp. 342–351). New York, NY: ACM. doi: 10.1145/1255175.1255243
Stringer, M. J., Sales-Pardo, M., & Nunes Amaral, L. A. (2008). Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE, 3(2), e1683 EP. doi:10.1371/journal.pone.0001683
Sutter, M., & Kochner, M. (2001). Power laws of research output. Evidence for Journals of Economics. Scientometrics, 51(2), 405–414. doi:10.1023/A:1012757802706.
Google Scholar
Times Higher Education World. (2012). University Rankings 2012-2013—Methodology: The essential elements in our world-leading formula. Retrieved from http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/methodology.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304.
Article Google Scholar
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi.20105.
Article Google Scholar
van den Besselaar, P., Bornmann, L., & Leydesdorff, L. (2014). Correction. Journal of Informetrics, 8(4), 801. doi:10.1016/j.joi.2014.07.008.
Article Google Scholar
van Raan, A. F. J. (2005). Fatal attraction: Conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics, 62(1), 133–143. doi:10.1007/s11192-005-0008-6.
Article Google Scholar
Waltman, L., Calero-Medina, C., Kosten, J., Noyons, E. C. M., Tijssen, R. J. W., van Eck, N. J., & van Raan, A. F. (2012). The Leiden ranking 2011/2012: Data collection, indicators, and interpretation. Journal of the American Society for Information Science and Technology, 63(12), 2419–2432. doi:10.1002/asi.22708.
Article Google Scholar
Wang, W., Neuman, E. J., & Newman, D. A. (2014). Statistical power of the social network autocorrelation model. Social Networks, 38, 88–99. doi:10.1016/j.socnet.2014.03.004.
Article Google Scholar
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012a). A boosted-trees method for name disambiguation: Scientometrics, 93(2), 391–411. doi:10.1007/s11192-012-0681-1.
Google Scholar
Wang, D. J., Shi, X., McFarland, D. A., & Leskovec, J. (2012b). Measurement error in network data: A re-classification. Social Networks,. doi:10.1016/j.socnet.2012.01.003.
Google Scholar
Wang, F., Yang, Y., Ma, Z., & Li, L. (2013). A three-stage clustering framework based on multiple feature combination for chinese person name disambiguation. Information Science and Cloud Computing Companion,. doi:10.1109/ISCC-C.2013.33.
Article Google Scholar
Wu, J., & Ding, X.-H. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697. doi:10.1007/s11192-013-0978-8.
Article MathSciNet Google Scholar
Wu, H., Li, B., Pei, Y., & He, J. (2014). Unsupervised author disambiguation using Dempster–Shafer theory. Scientometrics, 101(3), 1955–1972. doi:10.1007/s11192-014-1283-x.
Article Google Scholar
Xu, F., Li, X. X., Meng, W., Liu, W. B., & Mingers, J. (2013). Ranking academic impact of world national research institutes–by the Chinese Academy of Sciences. Research Evaluation, 22(5), 337–350. doi:10.1093/reseval/rvt007.
Article Google Scholar
Zhu, J., Yang, Y., Xie, Q., Wang, L., & Hassan, S.-U. (2014). Robust hybrid name disambiguation framework for large databases. Scientometrics, 98(3), 2255–2274. doi:10.1007/s11192-013-1151-0.
Article Google Scholar

Download references

Acknowledgments

An earlier version of this article was presented at the 14th International Society of Scientometrics and Informetrics Conference, Vienna, Austria, 15th–19th July 2013.

Author information

Authors and Affiliations

TU Bergakademie Freiberg, Schlossplatz 1, 09599, Freiberg, Germany
Jan Schulz

Authors

Jan Schulz
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schulz, J. Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses. Scientometrics 107, 1283–1298 (2016). https://doi.org/10.1007/s11192-016-1892-7

Download citation

Received: 12 October 2015
Published: 23 February 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11192-016-1892-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Abstract

Access this article

Similar content being viewed by others

Comparison of bibliometric measures for assessing relative importance of researchers

Deep and narrow impact: introducing location filtered citation counting

Hierarchical Modeling Approaches for Generating Author Blueprints

References

Acknowledgments

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using Monte Carlo simulations to assess the impact of author name disambiguation quality on different bibliometric analyses

Abstract

Access this article

Similar content being viewed by others

Comparison of bibliometric measures for assessing relative importance of researchers

Deep and narrow impact: introducing location filtered citation counting

Hierarchical Modeling Approaches for Generating Author Blueprints

References

Acknowledgments

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation