Abstract
Scientific network analysis takes at input large amounts of bibliographical data that are often incomplete. This leads to the introduction of different measurement errors in the scientific networks, which, in turn, influence the results of scientific networks analyses. Different authors have been studying the effects of measurement error on the results of network analysis, but these studies mostly rely on data gathered by survey questionnaires or on the study of incomplete data that are shown as random processes and emerge in unweighted undirected networks. This article aims at overcoming the limitations of these studies in three directions. First, we introduce measurement errors to network data following three most frequently present and well-known problems often present in bibliographic data: multiple authorship, homographs, and synonyms. Second, we apply missing data mechanisms to the identified incomplete data sources in order to link the latter with the probability of their occurrence. Third, we apply the incomplete data sources to different types of scientific networks and study the effects of measurement error in both, the weighted directed (i.e., citation) network and the weighted undirected (i.e., co-authorship) network. The results show that the most destructive incomplete data source is the problem of synonyms; it influences the accuracy and the robustness of the network structural measures the most. On the other hand, the multiple-authorship problem does not influence the results of network analysis at all.
Similar content being viewed by others
References
Algorithms and theory of computation handbook, CRC Press LLC (1999). “Levenshtein distance”. In P.E. Black (Ed.): Dictionary of algorithms and data structures (online). U.S. National Institute of Standards and Technology. 14 August 2008. http://www.nist.gov/dads/HTML/Levenshtein.html. Accessed 12 June 2013.
Borgatti, S. P., Carley, K. M., & Krackhardt, D. (2006). On the robustness of centrality measures under conditions of imperfect data. Social Networks, 28(2), 124–136.
Clauset, A., Moore, C., & Newman, M. E. J. (2008). Hierarchical structure and the prediction of missing links in networks. Nature, 453, 98–101.
Costenbader, E., & Valente, T. W. (2003). The stability of centrality measures when networks are sampled. Social Networks, 25(4), 283–307.
de Nooy, W., Mrvar, A., & Batagelj, V. (2005). Exploratory social network analysis with Pajek. New York: Cambridge University Press.
Egghe, L., Rousseau, R., & van Hooydonk, G. (2000). Methods for accrediting publications to authors or countries: Consequences for evaluation studies. Journal of the American Society for Information Science, 51(2), 145–157.
Everitt, B. (1974). Cluster analysis. London: Heinemann Educational Books Ltd.
Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. PNAS, 106(52), 22073–22078.
Kossinets, G. (2006). Effects of missing data in social networks. Social Networks, 28(3), 247–268.
Lindsey, D. (1980). Production and citation measures in the sociology of science: The problem of multiple authorship. Social Studies in Science, 10(2), 145–162.
MacRoberts, M. H., & MacRoberts, B. R. (1989). Problems of citation analysis: A critical review. Journal of the American Society for Information Science, 40(5), 342–349.
McKnight, P. E., McKnight, K. M., Sidani, S., & Figuerdo, A. J. (2007). Missing data: A gentle introduction. New York: The Guilford Press.
Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245–251.
Pajek, (2014) Program for large network analysis (Version 3.13). http://pajek.imfm.si/doku.php?id=pajek. Accessed 15 Sep 2014.
Phelan, T. J. (1999). A compendium of issues for citation analysis. Scientometrics, 45(1), 117–136.
Porter, A. L., & Rafols, I. (2009). Is science becoming more interdisciplinary? Measuring and mapping six research fields over time. Scientometrics, 81(3), 719–745.
R Core Team (2014). R: A language and environment for statistical computing. R foundation for statistical computing. vienna, Austria http://www.R-project.org/. Accessed 1 Feb 2014.
Rubin, D. B. (1976). Inference and missing data. Biometrica, 63(3), 581–592.
Smith, L. C. (1981). Citation analysis. Library Trends, 20(1), 83–106.
Wang, D. J., Shi, X., McFarland, D. A., & Leskovec, J. (2012). Measurement error in network data: A re-classification. Social Networks, 34(4), 396–409.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications. Cambridge: Cambridge University Press.
Acknowledgments
We thank the anonymous reviewer for providing helpful and constructive comments on an earlier version of the manuscript. We acknowledge the financial support of the Slovenian Research Agency through a Grant for training of young researchers and the Grant Number P5-0093 (B).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Erman, N., Todorovski, L. The effects of measurement error in case of scientific network analysis. Scientometrics 104, 453–473 (2015). https://doi.org/10.1007/s11192-015-1615-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-015-1615-5