Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh

Wrembel, Robert

doi:10.1007/978-3-031-39847-6_1

Robert Wrembel ORCID: orcid.org/0000-0001-6037-5718^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14146))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

735 Accesses

Abstract

For years, data integration (DI) architectures evolved from those supporting virtual integration, through physical integration, to those supporting both virtual and physical integration. Regardless of its type, all of the developed DI architectures include an integration layer. This layer is implemented by a sophisticated software, which runs the so-called DI processes. The integration layer is responsible for ingesting data from various sources (typically heterogeneous and distributed) and for homogenizing data into formats suitable for future processing and analysis. Nowadays, in all business domains, large volumes of highly heterogeneous data are produced, e.g., medical systems, smart cities, smart agriculture, which require further advancements in the data integration technologies. In this keynote talk paper, I present my personal opinion on still-to-be developed data integration techniques - potential research directions, namely: (1) more flexible DI, (2) quality assurance in complex multi-modal systems, (3) execution optimization of DI processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahle, U., Hemetsberger, L., Łakomski, M., Wrembel, R.: AI and data: how cities of the future will use data in their development (2023)
Google Scholar
Akkem, Y., Biswas, S.K., Varanasi, A.: Smart farming using artificial intelligence: a review. Eng. Appl. Artif. Intell. 120, 105899 (2023)
Article Google Scholar
Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. 29(1), 69–79 (2019)
Article Google Scholar
Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2
Article Google Scholar
Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27
Chapter Google Scholar
Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11
Chapter Google Scholar
Andrzejewski, W., Bebel, B., Boiński, P., Sienkiewicz, M., Wrembel, R.: Text similarity measures in a data deduplication pipeline for customers records. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 33–42. CEUR-WS.org (2023)
Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: PRESISTANT: learning based assistant for data pre-processing. Data Knowl. Eng. 123, 101727 (2019)
Article Google Scholar
Bode, J., Kühl, N., Kreuzberger, D., Hirschl, S., Holtmann, C.: Data mesh: best practices to avoid the data mess. CoRR, abs/2302.01713 (2023)
Google Scholar
Bodziony, M., Krzyzanowski, H., Pieta, L., Wrembel, R.: On discovering semantics of user-defined functions in data processing workflows. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 7:1–7:6. ACM (2021)
Google Scholar
Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 4:1–4:6 (2022)
Google Scholar
Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 2572 of CEUR Workshop Proceedings, pp. 74–78 (2020)
Google Scholar
Bodziony, M., Wrembel, R.: Reference architecture for running large scale data integration experiments. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2021. LNCS, vol. 12923, pp. 3–9. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86472-9_1
Chapter Google Scholar
Bodziony, M., Wrembel, R.: Data source connectors layer as a service - design patterns. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 76–80. CEUR-WS.org (2023)
Google Scholar
Boiński, P., Andrzejewski, W., Bębel, B., Wrembel, R.: On tuning the sorted neighborhood method for record comparisons in a data deduplication pipeline. In: International Conference on Database and Expert Systems Applications (DEXA). Springer, Cham (2023). Volume to appear of LNCS
Google Scholar
Boinski, P., Sienkiewicz, M., Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: lessons learned from a R&D project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference, volume 3135 of CEUR Workshop Proceedings (2022)
Google Scholar
Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers, Alphen aan den Rijn (1998). ISBN: 0792382161
Book Google Scholar
Brezany, P., Tjoa, A.M., Wanek, H., Wöhrer, A.: Mediators in the architecture of grid information systems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2003. LNCS, vol. 3019, pp. 788–795. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24669-5_103
Chapter Google Scholar
Chen, X., et al.: Leon: a new framework for ml-aided query optimization. VLDB Endowment 16(9), 2261–2273 (2023)
Article Google Scholar
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1-127:42 (2021)
Article Google Scholar
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: International Conference on Management of Data (SIGMOD), pp. 2201–2206. ACM (2016)
Google Scholar
Dehghani, Z.: Data Mesh: Delivering Data-Driven Value at Scale. O’Reilly, Newton (2022). ISBN: 1492092398
Google Scholar
DICOM. Dicom - digital imaging and communications in medicine. https://www.dicomstandard.org/
Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers, Burlington (1999). ISBN: 1-55860-216-X
Google Scholar
Errami, S.A., Hajji, H., Kadi, K.A.E., Badir, H.: Spatial big data architecture: from data warehouses and data lakes to the Lakehouse. J. Parallel Distrib. Comput. 176, 70–79 (2023)
Article Google Scholar
Fivetrain. Connectors for every data source. Accessed June 2023
Google Scholar
Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)
Article Google Scholar
Gillet, A., Leclercq, É., Cullot, N.: Lambda+, the renewal of the lambda architecture: category theory to the rescue. In: La Rosa, M., Sadiq, S., Teniente, E. (eds.) CAiSE 2021. LNCS, vol. 12751, pp. 381–396. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79382-1_23
Chapter Google Scholar
Giovanelli, J., Bilalli, B., Abelló, A.: Data pre-processing pipeline generation for AutoETL. Inf. Syst. 108, 101957 (2022)
Article Google Scholar
Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In; Conference on Scientific and Statistical Database Management (SSDBM), p. 36 (2014)
Google Scholar
Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. The MIT Press, Cambridge (1999)
Book Google Scholar
Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems (2023)
Google Scholar
Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)
Google Scholar
Harby, A.A., Zulkernine, F.: From data warehouse to Lakehouse: a comparative review. In: IEEE International Conference on Big Data, pp. 389–395 (2022)
Google Scholar
Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4
Chapter Google Scholar
Hernández, Á.B., Pérez, M.S., Gupta, S., Muntés-Mulero, V.: Using machine learning to optimize parallelism in big data applications. Future Gener. Comput. Syst. 86, 1076–1092 (2018)
Article Google Scholar
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research CIDR, pp. 261–272 (2011)
Google Scholar
Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)
Google Scholar
Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)
Article Google Scholar
IBM. IBM Cloud Pak for Data: Supported data sources. Accessed June 2023
Google Scholar
IBM: Introduction to InfoSphere DataStage balanced optimization. Documentation. Accessed June 2023
Google Scholar
Informatica: Pushdown optimization overview. Documentation. Accessed June 2023
Google Scholar
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Cham (2003). https://doi.org/10.1007/978-3-662-05153-5
Book MATH Google Scholar
Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and NoSQL databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)
Article Google Scholar
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)
Article Google Scholar
Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)
Article Google Scholar
Kechar, M., Bellatreche, L.: Safeness: suffix arrays driven materialized view selection framework for large-scale workloads. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2022. Lecture Notes in Computer Science, vol. 13428, pp. 74–86. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12670-3_7
Chapter Google Scholar
Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)
Article Google Scholar
Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds.) BIRTE 2010. LNBIP, vol. 84, pp. 68–83. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22970-1_6
Chapter Google Scholar
Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: the next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)
Google Scholar
Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)
Google Scholar
Munshi, A.A., Mohamed, Y.A.I.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018)
Article Google Scholar
Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)
Article Google Scholar
Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)
Article Google Scholar
Popescu, A.D., Ercegovac, V., Balmin, A., Branco, M., Ailamaki, A.: Same queries, different data: can we predict runtime performance? In: Workshops @ International Conference on Data Engineering (ICDE), pp. 275–280. IEEE Computer Society (2012)
Google Scholar
Quemy, A.: Binary classification in unstructured space with hypergraph case-based reasoning. Inf. Syst. 85, 92–113 (2019)
Article Google Scholar
Ramachandra, K., Park, K., Emani, K.V., Halverson, A., Galindo-Legaria, C.A., Cunningham, C.: Froid: optimization of imperative programs in a relational database. VLDB Endowment 11(4), 432–444 (2017)
Article Google Scholar
Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)
Article Google Scholar
Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13
Chapter Google Scholar
Rusinkiewicz, M., Czejdo, B., Embley, D.W.: An implementation model for muldidatabase queries. In: Karagiannis, D. (ed.) Database and Expert Systems Applications, pp. 309–314. Springer-Verlag, Vienna (1991). https://doi.org/10.1007/978-3-7091-7555-2_52
Chapter Google Scholar
Sichert, M., Neumann, T.: User-defined operators: efficiently integrating custom algorithms into modern databases. VLDB Endowment 15(5), 1119–1131 (2022)
Article Google Scholar
Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a R&D project. In: Workshops of the EDBT/ICDT Joint Conference, vol. 2841 (2021)
Google Scholar
Simitsis, A., Skiadopoulos, S., Vassiliadis, P.: The history, present, and future of ETL technology (invited). In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 3–12. CEUR-WS.org (2023)
Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)
Google Scholar
Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)
Article Google Scholar
Strengholt, P.: Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric. O’Reilly, Newton (2023). ISBN: 1098138864
Google Scholar
Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)
Google Scholar
Thomsen, C.: ETL. In Encyclopedia of Big Data Technologies, Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8
Book Google Scholar
Tsesmelis, D., Simitsis, A.: Database optimizers in the era of learning. In: International Conference on Data Engineering (ICDE), pp. 3213–3216 (2022)
Google Scholar
Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications, 2nd edn. Springer (2022). https://doi.org/10.1007/978-3-662-65167-4
Book Google Scholar
Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)
Article Google Scholar
Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)
Article Google Scholar
Zaharia, M., Ghodsi, A., Xin, R., Armbrust, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Conference on Innovative Data Systems Research (CIDR) (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Poznan University of Technology, Poznań, Poland
Robert Wrembel
Artificial Intelligence and Cybersecurity Center, Poznań, Poland
Robert Wrembel

Authors

Robert Wrembel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Ibaraki, Japan
Toshiyuki Amagasa
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wrembel, R. (2023). Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-39847-6_1
Published: 18 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39846-9
Online ISBN: 978-3-031-39847-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh