Skip to main content

Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models

  • Conference paper
  • First Online:
Computational Science – ICCS 2023 (ICCS 2023)

Abstract

The successful application of computational models presupposes access to accurate, relevant, and representative datasets. The growth of public data, and the increasing practice of data sharing and reuse, emphasises the importance of data provenance and increases the need for modellers to understand how data processing decisions might impact model output. One key step in the data processing pipeline is that of data integration and entity resolution, where entities are matched across disparate datasets. In this paper, we present a new formulation of data integration in complex networks that incorporates integration uncertainty. We define an approach for understanding how different data integration setups can impact the results of network diffusion models under this uncertainty, allowing one to systematically characterise potential model outputs in order to create an output distribution that provides a more comprehensive picture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, B.D., Ye, M.: Recent advances in the modelling and analysis of opinion dynamics on influence networks. Int. J. Autom. Comput. 16(2), 129–149 (2019)

    Article  Google Scholar 

  2. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Barbu, A., Zhu, S.C.: Monte Carlo Methods, vol. 35. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-2971-5

    Book  MATH  Google Scholar 

  4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)

    Article  Google Scholar 

  5. Bhattacharya, I., Getoor, L.: Entity resolution in graphs. Min. Graph Data 311 (2006)

    Google Scholar 

  6. Christen, P.: Febrl- an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1068 (2008)

    Google Scholar 

  7. Christen, P., Pudjijono, A.: Accurate synthetic generation of realistic personal information. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 507–514. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_47

    Chapter  Google Scholar 

  8. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)

    Article  Google Scholar 

  9. Dieck, R.H.: Measurement Uncertainty: Methods and Applications. ISA (2007)

    Google Scholar 

  10. Dong, X.L., Srivastava, D.: Big data integration. Synth. Lect. Data Manag. 7(1), 1–198 (2015)

    Article  Google Scholar 

  11. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  12. Genossar, B., Shraga, R., Gal, A.: FlexER: flexible entity resolution for multiple intents. arXiv preprint arXiv:2209.07569 (2022)

  13. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. Proc. VLDB Endow. 5(12), 2018–2019 (2012)

    Article  Google Scholar 

  14. Goodwin, G.C., Ninness, B., Salgado, M.E.: Quantification of uncertainty in estimation. In: 1990 American Control Conference, pp. 2400–2405. IEEE (1990)

    Google Scholar 

  15. Kermack, W.O., McKendrick, A.G.: A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. Ser. A Containing Papers of a Mathematical and Physical Character 115(772), 700–721 (1927)

    Google Scholar 

  16. Kiss, I.Z., Miller, J.C., Simon, P.L., et al.: Mathematics of Epidemics on Networks, vol. 598, p. 31. Springer, Cham (2017)

    Book  Google Scholar 

  17. Kolossa, A., Kopp, B.: Data quality over data quantity in computational cognitive neuroscience. Neuroimage 172, 775–785 (2018)

    Article  Google Scholar 

  18. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)

    Article  Google Scholar 

  19. Lepot, M., Aubin, J.B., Clemens, F.H.: Interpolation in time series: an introductive overview of existing methods, their performance criteria and uncertainty assessment. Water 9(10), 796 (2017)

    Article  Google Scholar 

  20. Levenshtein, V.I., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710. Soviet Union (1966)

    Google Scholar 

  21. Ley, C., Bordas, S.P.: What makes data science different? A discussion involving statistics 2.0 and computational sciences. Int. J. Data Sci. Anal. 6, 167–175 (2018)

    Article  Google Scholar 

  22. Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.C.: Deep entity matching: challenges and opportunities. J. Data Inf. Qual. (JDIQ) 13(1), 1–17 (2021)

    Article  Google Scholar 

  23. López-Pintado, D.: Diffusion in complex social networks. Games Econom. Behav. 62(2), 573–590 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  24. Lü, L., Zhou, T.: Link prediction in complex networks: a survey. Physica A 390(6), 1150–1170 (2011)

    Article  Google Scholar 

  25. Metropolis, N., Ulam, S.: The Monte Carlo method. J. Am. Stat. Assoc. 44(247), 335–341 (1949)

    Article  MATH  Google Scholar 

  26. Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 2(1), 86–97 (2012)

    Google Scholar 

  27. Nevin, J.: Data Integration Landscape Naive Implementation. University of Amsterdam, V1 (2023). https://doi.org/10.17632/9jdzy6jr82.1

  28. Nevin, J., Lees, M., Groth, P.: The non-linear impact of data handling on network diffusion models. Patterns 2(12), 100397 (2021)

    Article  Google Scholar 

  29. Radosz, W., Doniec, M.: Three-state opinion Q-voter model with bounded confidence. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2021. LNCS, vol. 12744, pp. 295–301. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77967-2_24

    Chapter  Google Scholar 

  30. Rainer, H., Krause, U.: Opinion dynamics and bounded confidence: models, analysis and simulation (2002)

    Google Scholar 

  31. Rice, E., Holloway, I.W., Barman-Adhikari, A., Fuentes, D., Brown, C.H., Palinkas, L.A.: A mixed methods approach to network data collection. Field Methods 26(3), 252–268 (2014)

    Article  Google Scholar 

  32. Roy, C.J., Oberkampf, W.L.: A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing. Comput. Methods Appl. Mech. Eng. 200(25–28), 2131–2144 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  33. Rude, U., Willcox, K., McInnes, L.C., Sterck, H.D.: Research and education in computational science and engineering. SIAM Rev. 60(3), 707–754 (2018)

    Article  MathSciNet  Google Scholar 

  34. Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748. IEEE (2020)

    Google Scholar 

  35. Smith, R.C.: Uncertainty Quantification: Theory, Implementation, and Applications, vol. 12. SIAM (2013)

    Google Scholar 

  36. Sullivan, T.J.: Introduction to Uncertainty Quantification, vol. 63. Springer, Cham (2015)

    MATH  Google Scholar 

  37. Wit, E., van den Heuvel, E., Romeijn, J.W.: ‘All models are wrong...’: an introduction to model uncertainty. Statistica Neerlandica 66(3), 217–236 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Nevin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nevin, J., Groth, P., Lees, M. (2023). Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14073. Springer, Cham. https://doi.org/10.1007/978-3-031-35995-8_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35995-8_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35994-1

  • Online ISBN: 978-3-031-35995-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics