Skip to main content

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

  • Conference paper
  • First Online:
Information Integration and Web Intelligence (iiWAS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13635))

Included in the following conference series:

  • 726 Accesses

Abstract

In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes.

To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance.

In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914 (2014)

    Google Scholar 

  2. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2

    Article  Google Scholar 

  3. Ali, S.M.F., Wrembel, R.: Towards a Cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27

    Chapter  Google Scholar 

  4. Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11

    Chapter  Google Scholar 

  5. Azzini, A., et al.: Advances in data management in the big data era. In: Goedicke, M., Neuhold, E., Rannenberg, K. (eds.) Advancing Research in Information and Communication Technology. IFIP AICT, vol. 600, pp. 99–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81701-5_4

    Chapter  Google Scholar 

  6. Bhattacharya, I., Getoor, L.: A latent Dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58. SIAM (2006)

    Google Scholar 

  7. Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining (ICDM), pp. 87–96 (2006)

    Google Scholar 

  8. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)

    Google Scholar 

  9. Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE), in conjunction with IGMOD/PODS, pp. 4:1–4:6 (2022)

    Google Scholar 

  10. Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: DOLAP. CEUR Workshop Proceedings, vol. 2572, pp. 74–78 (2020)

    Google Scholar 

  11. Boinski, P., Sienkiewicz, M, Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: Lessons learned from a r&d project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 3135 (2022)

    Google Scholar 

  12. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining (SDM), pp. 243–254 (2008)

    Google Scholar 

  13. Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers (1998). ISBN 0792382161

    Google Scholar 

  14. Brook, C.: What is a health information system? DataGuardian (2020). http://digitalguardian.com/blog/what-health-information-system

  15. Brunner, U., Stockinger, K.: Entity matching on unstructured data: an active learning approach. In: Swiss Conference on Data Science SDS, pp. 97–102 (2019)

    Google Scholar 

  16. Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)

    Article  Google Scholar 

  17. Charles, M.: Pacs. TechTarget. http://searchhealthit.techtarget.com/definition/picture-archiving-and-communication-system-PACS

  18. Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5

    Chapter  Google Scholar 

  19. Christen, P.: A comparison of personal name matching: techniques and practical issues. In: International Conference on Data Mining (ICDM), pp. 290–294 (2006)

    Google Scholar 

  20. Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Data-Centric Systems and Applications (2012)

    Google Scholar 

  21. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1–127:42 (2021)

    Google Scholar 

  22. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)

    Google Scholar 

  23. de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Mendelzon, A. (ed.) International Workshop on Foundations of Data Management. CEUR Workshop Proceedings. vol. 2100 (2018)

    Google Scholar 

  24. Dremio. The next-generation cloud data lake: An open, no-copy data architecture (2021). http://www.hello.dremio.com/wp-the-next-generation-cloud-data-lake.html

  25. Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers (1999). ISBN 1-55860-216-X

    Google Scholar 

  26. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  27. Evangelista, L.O., Cortez, E., da Silva, A.S., Jr. W.M.: Adaptive and flexible blocking for record linkage tasks. J. Inf. Data Manage. 1(2), 167–182 (2010)

    Google Scholar 

  28. Gartner. Magic quadrant for data integration tools (2022)

    Google Scholar 

  29. Gheini, M., Kejriwal, M.: Unsupervised product entity resolution using graph representation learning. In: SIGIR Workshop on eCommerce @ ACM SIGIR International Conference on Research and Development in Information Retrieval. CEUR Workshop Proceedings, vol. 2410 (2019)

    Google Scholar 

  30. Hameed, M., Naumann, F.: Data preparation: a survey of commercial tools. SIGMOD Record 49(3), 18–29 (2020)

    Article  Google Scholar 

  31. Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4

    Chapter  Google Scholar 

  32. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl. Discov. 2(1), 9–37 (1998)

    Article  Google Scholar 

  33. Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: International Conference Data Engineering (ICDE), pp. 1292–1295 (2013)

    Google Scholar 

  34. Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  35. IBM. IBM InfoSphere DataStage Balanced Optimization. (IBM Whitepaper, Accessed on 18/03/2019)

    Google Scholar 

  36. Informatica. How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. http://www.informatica.com/downloads/pushdown_wp_6650_web.pdf

  37. Ryan, U.B.J.: A comparison of cloud data warehouse platforms, 2019. Sonora Intelligence. http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html

  38. Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. VLDB Endowment 15(1), 31–45 (2021)

    Article  Google Scholar 

  39. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer (2003)

    Google Scholar 

  40. Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and nosql databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)

    Article  Google Scholar 

  41. Jin, X., Wah, B.W., Cheng, X., Wang, Y.: Significance and challenges of big data research. Big Data Res. 2(2), 59–64 (2015)

    Article  Google Scholar 

  42. Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)

    Article  Google Scholar 

  43. Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE International Conference on Data Mining, pp. 340–349 (2013)

    Google Scholar 

  44. Kerner, S.: Top 8 cloud data warehouses, 2019. Datamation (2019). http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html

  45. King, T.: Top 12 free and open source etl tools for data integration. Solution Review (2019). http://solutionsreview.com/data-integration/top-free-and-open-source-etl-tools-for-data-integration/

  46. Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)

    Article  Google Scholar 

  47. Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)

    Article  Google Scholar 

  48. LaPlante, A.: Building a unified data infrastructure, 2020. O’Reilly whitepaper

    Google Scholar 

  49. Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. IBM Developer Works white paper (2014)

    Google Scholar 

  50. Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: The next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)

    Google Scholar 

  51. Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)

    Google Scholar 

  52. Mandilaras, G.M., et al.: Reproducible experiments on three-dimensional entity resolution with jedai. Inf. Syst. 102, 101830 (2021)

    Article  Google Scholar 

  53. Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: SIGMOD International Conference on Management of Data, pp. 1133–1147 (2020)

    Google Scholar 

  54. Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 440–445 (2006)

    Google Scholar 

  55. S. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD International Conference on Management of Data, pp. 19–34 (2018)

    Google Scholar 

  56. Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)

    Article  Google Scholar 

  57. Naumann, F.: Similarity measures. Hasso Plattner Institut (2013)

    Google Scholar 

  58. Nodet, P., Lemaire, V., Bondu, A., Cornuéjols, A., Ouorou, A.: From weakly supervised learning to biquality learning: an introduction. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–10 (2021)

    Google Scholar 

  59. Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)

    Article  Google Scholar 

  60. Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. VLDB Endowment 15(8), 1726–1738 (2022)

    Article  Google Scholar 

  61. Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: International Conference on Extending Database Technology EDBT, pp. 325–330 (2021)

    Google Scholar 

  62. Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM Comput. Surv. 53(2), 31:1–31:42 (2020)

    Google Scholar 

  63. Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Record 48(4), 30–36 (2019)

    Article  Google Scholar 

  64. Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. VLDB Endowment 14(10), 1913–1921 (2021)

    Article  Google Scholar 

  65. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: an extensible logical optimizer for udf-heavy data flows. Inf. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  66. Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13

    Chapter  Google Scholar 

  67. Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper

    Google Scholar 

  68. Russom, P.: Modernizing the logical data warehouse, 2019. TDWI white paper. http://tdwi.org/articles/2019/10/14/dwt-all-modernizing-the-logical-data-warehouse.aspx

  69. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)

    Google Scholar 

  70. Sariyar, M., Borg, A., Pommerening, K.: Active learning strategies for the deduplication of electronic patient data using classification trees. J. Biomed. Inf. 45(5), 893–900 (2012)

    Article  Google Scholar 

  71. ScienceSoft. Data warehouse in the cloud: features, important integrations, success factors, benefits and more. http://www.scnsoft.com/analytics/data-warehouse/cloud

  72. Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: Nat. Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 862–867 (2005)

    Google Scholar 

  73. Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a r&d project. In: Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 2841 (2021)

    Google Scholar 

  74. Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)

    Google Scholar 

  75. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)

    Article  Google Scholar 

  76. Soliman, M.A., et al.: A framework for emulating database operations in cloud data warehouses. In: International Conference on Management of Data (SIGMOD), pp. 1447–1461 (2020)

    Google Scholar 

  77. Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Int. J. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  78. Friedman, N.H.T.: Data hubs, data lakes and data warehouses: how they are different and why they are better together. Gartner (2020)

    Google Scholar 

  79. Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017)

    Google Scholar 

  80. Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)

    Google Scholar 

  81. Thomsen, C.: ETL. In: Encyclopedia of Big Data Technologies. Springer (2019)

    Google Scholar 

  82. Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Springer, Data-Centric Systems and Applications (2014)

    Google Scholar 

  83. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)

    Article  Google Scholar 

  84. Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)

    Article  Google Scholar 

  85. Wrembel, R., Abelló, A., Song, I.: DOLAP data warehouse research over two decades: trends and challenges. Inf. Syst. 85, 44–47 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wrembel, R. (2022). Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21047-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21046-4

  • Online ISBN: 978-3-031-21047-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics