Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects

Wrembel, Robert

doi:10.1007/978-3-031-21047-1_1

Robert Wrembel ORCID: orcid.org/0000-0001-6037-5718¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13635))

Included in the following conference series:

International Conference on Information Integration and Web

726 Accesses

Abstract

In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes.

To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance.

In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914 (2014)
Google Scholar
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Article Google Scholar
Ali, S.M.F., Wrembel, R.: Towards a Cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Chapter Google Scholar
Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11
Chapter Google Scholar
Azzini, A., et al.: Advances in data management in the big data era. In: Goedicke, M., Neuhold, E., Rannenberg, K. (eds.) Advancing Research in Information and Communication Technology. IFIP AICT, vol. 600, pp. 99–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81701-5_4
Chapter Google Scholar
Bhattacharya, I., Getoor, L.: A latent Dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58. SIAM (2006)
Google Scholar
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining (ICDM), pp. 87–96 (2006)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)
Google Scholar
Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE), in conjunction with IGMOD/PODS, pp. 4:1–4:6 (2022)
Google Scholar
Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: DOLAP. CEUR Workshop Proceedings, vol. 2572, pp. 74–78 (2020)
Google Scholar
Boinski, P., Sienkiewicz, M, Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: Lessons learned from a r&d project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 3135 (2022)
Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining (SDM), pp. 243–254 (2008)
Google Scholar
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers (1998). ISBN 0792382161
Google Scholar
Brook, C.: What is a health information system? DataGuardian (2020). http://digitalguardian.com/blog/what-health-information-system
Brunner, U., Stockinger, K.: Entity matching on unstructured data: an active learning approach. In: Swiss Conference on Data Science SDS, pp. 97–102 (2019)
Google Scholar
Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)
Article Google Scholar
Charles, M.: Pacs. TechTarget. http://searchhealthit.techtarget.com/definition/picture-archiving-and-communication-system-PACS
Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5
Chapter Google Scholar
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: International Conference on Data Mining (ICDM), pp. 290–294 (2006)
Google Scholar
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Data-Centric Systems and Applications (2012)
Google Scholar
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1–127:42 (2021)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Google Scholar
de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Mendelzon, A. (ed.) International Workshop on Foundations of Data Management. CEUR Workshop Proceedings. vol. 2100 (2018)
Google Scholar
Dremio. The next-generation cloud data lake: An open, no-copy data architecture (2021). http://www.hello.dremio.com/wp-the-next-generation-cloud-data-lake.html
Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers (1999). ISBN 1-55860-216-X
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Evangelista, L.O., Cortez, E., da Silva, A.S., Jr. W.M.: Adaptive and flexible blocking for record linkage tasks. J. Inf. Data Manage. 1(2), 167–182 (2010)
Google Scholar
Gartner. Magic quadrant for data integration tools (2022)
Google Scholar
Gheini, M., Kejriwal, M.: Unsupervised product entity resolution using graph representation learning. In: SIGIR Workshop on eCommerce @ ACM SIGIR International Conference on Research and Development in Information Retrieval. CEUR Workshop Proceedings, vol. 2410 (2019)
Google Scholar
Hameed, M., Naumann, F.: Data preparation: a survey of commercial tools. SIGMOD Record 49(3), 18–29 (2020)
Article Google Scholar
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4
Chapter Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl. Discov. 2(1), 9–37 (1998)
Article Google Scholar
Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: International Conference Data Engineering (ICDE), pp. 1292–1295 (2013)
Google Scholar
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
Article Google Scholar
IBM. IBM InfoSphere DataStage Balanced Optimization. (IBM Whitepaper, Accessed on 18/03/2019)
Google Scholar
Informatica. How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. http://www.informatica.com/downloads/pushdown_wp_6650_web.pdf
Ryan, U.B.J.: A comparison of cloud data warehouse platforms, 2019. Sonora Intelligence. http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. VLDB Endowment 15(1), 31–45 (2021)
Article Google Scholar
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer (2003)
Google Scholar
Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and nosql databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)
Article Google Scholar
Jin, X., Wah, B.W., Cheng, X., Wang, Y.: Significance and challenges of big data research. Big Data Res. 2(2), 59–64 (2015)
Article Google Scholar
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Article Google Scholar
Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE International Conference on Data Mining, pp. 340–349 (2013)
Google Scholar
Kerner, S.: Top 8 cloud data warehouses, 2019. Datamation (2019). http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html
King, T.: Top 12 free and open source etl tools for data integration. Solution Review (2019). http://solutionsreview.com/data-integration/top-free-and-open-source-etl-tools-for-data-integration/
Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)
Article Google Scholar
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Article Google Scholar
LaPlante, A.: Building a unified data infrastructure, 2020. O’Reilly whitepaper
Google Scholar
Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. IBM Developer Works white paper (2014)
Google Scholar
Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: The next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)
Google Scholar
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Google Scholar
Mandilaras, G.M., et al.: Reproducible experiments on three-dimensional entity resolution with jedai. Inf. Syst. 102, 101830 (2021)
Article Google Scholar
Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: SIGMOD International Conference on Management of Data, pp. 1133–1147 (2020)
Google Scholar
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 440–445 (2006)
Google Scholar
S. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD International Conference on Management of Data, pp. 19–34 (2018)
Google Scholar
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)
Article Google Scholar
Naumann, F.: Similarity measures. Hasso Plattner Institut (2013)
Google Scholar
Nodet, P., Lemaire, V., Bondu, A., Cornuéjols, A., Ouorou, A.: From weakly supervised learning to biquality learning: an introduction. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–10 (2021)
Google Scholar
Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)
Article Google Scholar
Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. VLDB Endowment 15(8), 1726–1738 (2022)
Article Google Scholar
Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: International Conference on Extending Database Technology EDBT, pp. 325–330 (2021)
Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM Comput. Surv. 53(2), 31:1–31:42 (2020)
Google Scholar
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Record 48(4), 30–36 (2019)
Article Google Scholar
Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. VLDB Endowment 14(10), 1913–1921 (2021)
Article Google Scholar
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: an extensible logical optimizer for udf-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Article Google Scholar
Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13
Chapter Google Scholar
Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper
Google Scholar
Russom, P.: Modernizing the logical data warehouse, 2019. TDWI white paper. http://tdwi.org/articles/2019/10/14/dwt-all-modernizing-the-logical-data-warehouse.aspx
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Google Scholar
Sariyar, M., Borg, A., Pommerening, K.: Active learning strategies for the deduplication of electronic patient data using classification trees. J. Biomed. Inf. 45(5), 893–900 (2012)
Article Google Scholar
ScienceSoft. Data warehouse in the cloud: features, important integrations, success factors, benefits and more. http://www.scnsoft.com/analytics/data-warehouse/cloud
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: Nat. Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 862–867 (2005)
Google Scholar
Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a r&d project. In: Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 2841 (2021)
Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)
Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Article Google Scholar
Soliman, M.A., et al.: A framework for emulating database operations in cloud data warehouses. In: International Conference on Management of Data (SIGMOD), pp. 1447–1461 (2020)
Google Scholar
Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Int. J. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)
Article MathSciNet MATH Google Scholar
Friedman, N.H.T.: Data hubs, data lakes and data warehouses: how they are different and why they are better together. Gartner (2020)
Google Scholar
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017)
Google Scholar
Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Google Scholar
Thomsen, C.: ETL. In: Encyclopedia of Big Data Technologies. Springer (2019)
Google Scholar
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Springer, Data-Centric Systems and Applications (2014)
Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Article Google Scholar
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)
Article Google Scholar
Wrembel, R., Abelló, A., Song, I.: DOLAP data warehouse research over two decades: trends and challenges. Inf. Syst. 85, 44–47 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Poznan University of Technology, Poznan, Poland
Robert Wrembel

Authors

Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

La Trobe University, Melbourne, VIC, Australia
Eric Pardede
Monash University, Melbourne, VIC, Australia
Pari Delir Haghighi
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wrembel, R. (2022). Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-21047-1_1
Published: 20 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21046-4
Online ISBN: 978-3-031-21047-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects