Abstract
In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same real-world objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs as well as long runtimes.
To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance.
In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alamuri, M., Surampudi, B.R., Negi, A.: A survey of distance/similarity measures for categorical data. In: International Joint Conference on Neural Networks (IJCNN), pp. 1907–1914 (2014)
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Ali, S.M.F., Wrembel, R.: Towards a Cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11
Azzini, A., et al.: Advances in data management in the big data era. In: Goedicke, M., Neuhold, E., Rannenberg, K. (eds.) Advancing Research in Information and Communication Technology. IFIP AICT, vol. 600, pp. 99–126. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81701-5_4
Bhattacharya, I., Getoor, L.: A latent Dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining, pp. 47–58. SIAM (2006)
Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: learning to scale up record linkage. In: IEEE International Conference on Data Mining (ICDM), pp. 87–96 (2006)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)
Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE), in conjunction with IGMOD/PODS, pp. 4:1–4:6 (2022)
Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: DOLAP. CEUR Workshop Proceedings, vol. 2572, pp. 74–78 (2020)
Boinski, P., Sienkiewicz, M, Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: Lessons learned from a r&d project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 3135 (2022)
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: SIAM International Conference on Data Mining (SDM), pp. 243–254 (2008)
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers (1998). ISBN 0792382161
Brook, C.: What is a health information system? DataGuardian (2020). http://digitalguardian.com/blog/what-health-information-system
Brunner, U., Stockinger, K.: Entity matching on unstructured data: an active learning approach. In: Swiss Conference on Data Science SDS, pp. 97–102 (2019)
Ceravolo, P., et al.: Big data semantics. J. Data Semant. 7(2), 65–85 (2018)
Charles, M.: Pacs. TechTarget. http://searchhealthit.techtarget.com/definition/picture-archiving-and-communication-system-PACS
Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: International Conference on Data Mining (ICDM), pp. 290–294 (2006)
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Data-Centric Systems and Applications (2012)
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1–127:42 (2021)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Mendelzon, A. (ed.) International Workshop on Foundations of Data Management. CEUR Workshop Proceedings. vol. 2100 (2018)
Dremio. The next-generation cloud data lake: An open, no-copy data architecture (2021). http://www.hello.dremio.com/wp-the-next-generation-cloud-data-lake.html
Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers (1999). ISBN 1-55860-216-X
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Evangelista, L.O., Cortez, E., da Silva, A.S., Jr. W.M.: Adaptive and flexible blocking for record linkage tasks. J. Inf. Data Manage. 1(2), 167–182 (2010)
Gartner. Magic quadrant for data integration tools (2022)
Gheini, M., Kejriwal, M.: Unsupervised product entity resolution using graph representation learning. In: SIGIR Workshop on eCommerce @ ACM SIGIR International Conference on Research and Development in Information Retrieval. CEUR Workshop Proceedings, vol. 2410 (2019)
Hameed, M., Naumann, F.: Data preparation: a survey of commercial tools. SIGMOD Record 49(3), 18–29 (2020)
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl. Discov. 2(1), 9–37 (1998)
Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: International Conference Data Engineering (ICDE), pp. 1292–1295 (2013)
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
IBM. IBM InfoSphere DataStage Balanced Optimization. (IBM Whitepaper, Accessed on 18/03/2019)
Informatica. How to Achieve Flexible, Cost-effective Scalability and Performance through Pushdown Processing. http://www.informatica.com/downloads/pushdown_wp_6650_web.pdf
Ryan, U.B.J.: A comparison of cloud data warehouse platforms, 2019. Sonora Intelligence. http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. VLDB Endowment 15(1), 31–45 (2021)
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer (2003)
Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and nosql databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)
Jin, X., Wah, B.W., Cheng, X., Wang, Y.: Significance and challenges of big data research. Big Data Res. 2(2), 59–64 (2015)
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: IEEE International Conference on Data Mining, pp. 340–349 (2013)
Kerner, S.: Top 8 cloud data warehouses, 2019. Datamation (2019). http://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html
King, T.: Top 12 free and open source etl tools for data integration. Solution Review (2019). http://solutionsreview.com/data-integration/top-free-and-open-source-etl-tools-for-data-integration/
Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
LaPlante, A.: Building a unified data infrastructure, 2020. O’Reilly whitepaper
Lella, R.: Optimizing BDFS jobs using InfoSphere DataStage Balanced Optimization. IBM Developer Works white paper (2014)
Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: The next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Mandilaras, G.M., et al.: Reproducible experiments on three-dimensional entity resolution with jedai. Inf. Syst. 102, 101830 (2021)
Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: SIGMOD International Conference on Management of Data, pp. 1133–1147 (2020)
Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 440–445 (2006)
S. Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD International Conference on Management of Data, pp. 19–34 (2018)
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)
Naumann, F.: Similarity measures. Hasso Plattner Institut (2013)
Nodet, P., Lemaire, V., Bondu, A., Cornuéjols, A., Ouorou, A.: From weakly supervised learning to biquality learning: an introduction. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–10 (2021)
Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)
Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. VLDB Endowment 15(8), 1726–1738 (2022)
Paganelli, M., Buono, F.D., Pevarello, M., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: International Conference on Extending Database Technology EDBT, pp. 325–330 (2021)
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM Comput. Surv. 53(2), 31:1–31:42 (2020)
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Record 48(4), 30–36 (2019)
Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. VLDB Endowment 14(10), 1913–1921 (2021)
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: Sofa: an extensible logical optimizer for udf-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13
Russom, P.: Data lakes: purposes, practices, patterns, and platforms (2017). TDWI white paper
Russom, P.: Modernizing the logical data warehouse, 2019. TDWI white paper. http://tdwi.org/articles/2019/10/14/dwt-all-modernizing-the-logical-data-warehouse.aspx
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278 (2002)
Sariyar, M., Borg, A., Pommerening, K.: Active learning strategies for the deduplication of electronic patient data using classification trees. J. Biomed. Inf. 45(5), 893–900 (2012)
ScienceSoft. Data warehouse in the cloud: features, important integrations, success factors, benefits and more. http://www.scnsoft.com/analytics/data-warehouse/cloud
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: Nat. Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference, pp. 862–867 (2005)
Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a r&d project. In: Workshops of the EDBT/ICDT Joint Conference. CEUR Workshop Proceedings, vol. 2841 (2021)
Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Soliman, M.A., et al.: A framework for emulating database operations in cloud data warehouses. In: International Conference on Management of Data (SIGMOD), pp. 1447–1461 (2020)
Stefanowski, J., Krawiec, K., Wrembel, R.: Exploring complex and big data. Int. J. Appl. Math. Comput. Sci. 27(4), 669–679 (2017)
Friedman, N.H.T.: Data hubs, data lakes and data warehouses: how they are different and why they are better together. Gartner (2020)
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220. IEEE Computer Society (2017)
Terrizzano, I.G., Schwarz, P.M., Roth, M., Colino, J.E.: Data wrangling: the challenging yourney from the wild to the lake. In: Conference on Innovative Data Systems Research (CIDR) (2015)
Thomsen, C.: ETL. In: Encyclopedia of Big Data Technologies. Springer (2019)
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Springer, Data-Centric Systems and Applications (2014)
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)
Wrembel, R., Abelló, A., Song, I.: DOLAP data warehouse research over two decades: trends and challenges. Inf. Syst. 85, 44–47 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wrembel, R. (2022). Data Integration, Cleaning, and Deduplication: Research Versus Industrial Projects. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds) Information Integration and Web Intelligence. iiWAS 2022. Lecture Notes in Computer Science, vol 13635. Springer, Cham. https://doi.org/10.1007/978-3-031-21047-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-21047-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21046-4
Online ISBN: 978-3-031-21047-1
eBook Packages: Computer ScienceComputer Science (R0)