Abstract
The successful application of computational models presupposes access to accurate, relevant, and representative datasets. The growth of public data, and the increasing practice of data sharing and reuse, emphasises the importance of data provenance and increases the need for modellers to understand how data processing decisions might impact model output. One key step in the data processing pipeline is that of data integration and entity resolution, where entities are matched across disparate datasets. In this paper, we present a new formulation of data integration in complex networks that incorporates integration uncertainty. We define an approach for understanding how different data integration setups can impact the results of network diffusion models under this uncertainty, allowing one to systematically characterise potential model outputs in order to create an output distribution that provides a more comprehensive picture.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, B.D., Ye, M.: Recent advances in the modelling and analysis of opinion dynamics on influence networks. Int. J. Autom. Comput. 16(2), 129–149 (2019)
Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)
Barbu, A., Zhu, S.C.: Monte Carlo Methods, vol. 35. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-2971-5
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Bhattacharya, I., Getoor, L.: Entity resolution in graphs. Min. Graph Data 311 (2006)
Christen, P.: Febrl- an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)
Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_47
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)
Dieck, R.H.: Measurement Uncertainty: Methods and Applications. ISA (2007)
Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Genossar, B., Shraga, R., Gal, A.: FlexER: flexible entity resolution for multiple intents. arXiv preprint arXiv:2209.07569 (2022)
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012)
Goodwin, G.C., Ninness, B., Salgado, M.E.: Quantification of uncertainty in estimation. In: 1990 American Control Conference, pp. 2400–2405. IEEE (1990)
Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. Ser. A Containing Papers of a Mathematical and Physical Character 115(772), 700–721 (1927)
Kiss, I.Z., Miller, J.C., Simon, P.L., et al.: Mathematics of Epidemics on Networks, vol. 598, p. 31. Springer, Cham (2017)
Kolossa, A., Kopp, B.: Data quality over data quantity in computational cognitive neuroscience. Neuroimage 172, 775–785 (2018)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)
Lepot, M., Aubin, J.B., Clemens, F.H.: Interpolation in time series: an introductive overview of existing methods, their performance criteria and uncertainty assessment. Water 9(10), 796 (2017)
Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)
Ley, C., Bordas, S.P.: What makes data science different? A discussion involving statistics 2.0 and computational sciences. Int. J. Data Sci. Anal. 6, 167–175 (2018)
Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.C.: Deep entity matching: challenges and opportunities. J. Data Inf. Qual. (JDIQ) 13(1), 1–17 (2021)
López-Pintado, D.: Diffusion in complex social networks. Games Econom. Behav. 62(2), 573–590 (2008)
Lü, L., Zhou, T.: Link prediction in complex networks: a survey. Physica A 390(6), 1150–1170 (2011)
Metropolis, N., Ulam, S.: The Monte Carlo method. J. Am. Stat. Assoc. 44(247), 335–341 (1949)
Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 2(1), 86–97 (2012)
Nevin, J.: Data Integration Landscape Naive Implementation. University of Amsterdam, V1 (2023). https://doi.org/10.17632/9jdzy6jr82.1
Nevin, J., Lees, M., Groth, P.: The non-linear impact of data handling on network diffusion models. Patterns 2(12), 100397 (2021)
Radosz, W., Doniec, M.: Three-state opinion Q-voter model with bounded confidence. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2021. LNCS, vol. 12744, pp. 295–301. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77967-2_24
Rainer, H., Krause, U.: Opinion dynamics and bounded confidence: models, analysis and simulation (2002)
Rice, E., Holloway, I.W., Barman-Adhikari, A., Fuentes, D., Brown, C.H., Palinkas, L.A.: A mixed methods approach to network data collection. Field Methods 26(3), 252–268 (2014)
Roy, C.J., Oberkampf, W.L.: A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing. Comput. Methods Appl. Mech. Eng. 200(25–28), 2131–2144 (2011)
Rude, U., Willcox, K., McInnes, L.C., Sterck, H.D.: Research and education in computational science and engineering. SIAM Rev. 60(3), 707–754 (2018)
Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748. IEEE (2020)
Smith, R.C.: Uncertainty Quantification: Theory, Implementation, and Applications, vol. 12. SIAM (2013)
Sullivan, T.J.: Introduction to Uncertainty Quantification, vol. 63. Springer, Cham (2015)
Wit, E., van den Heuvel, E., Romeijn, J.W.: ‘All models are wrong...’: an introduction to model uncertainty. Statistica Neerlandica 66(3), 217–236 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nevin, J., Groth, P., Lees, M. (2023). Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14073. Springer, Cham. https://doi.org/10.1007/978-3-031-35995-8_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-35995-8_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35994-1
Online ISBN: 978-3-031-35995-8
eBook Packages: Computer ScienceComputer Science (R0)