Skip to main content

Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14146))

Included in the following conference series:

  • 735 Accesses

Abstract

For years, data integration (DI) architectures evolved from those supporting virtual integration, through physical integration, to those supporting both virtual and physical integration. Regardless of its type, all of the developed DI architectures include an integration layer. This layer is implemented by a sophisticated software, which runs the so-called DI processes. The integration layer is responsible for ingesting data from various sources (typically heterogeneous and distributed) and for homogenizing data into formats suitable for future processing and analysis. Nowadays, in all business domains, large volumes of highly heterogeneous data are produced, e.g., medical systems, smart cities, smart agriculture, which require further advancements in the data integration technologies. In this keynote talk paper, I present my personal opinion on still-to-be developed data integration techniques - potential research directions, namely: (1) more flexible DI, (2) quality assurance in complex multi-modal systems, (3) execution optimization of DI processes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahle, U., Hemetsberger, L., Łakomski, M., Wrembel, R.: AI and data: how cities of the future will use data in their development (2023)

    Google Scholar 

  2. Akkem, Y., Biswas, S.K., Varanasi, A.: Smart farming using artificial intelligence: a review. Eng. Appl. Artif. Intell. 120, 105899 (2023)

    Article  Google Scholar 

  3. Ali, S.M.F., Mey, J., Thiele, M.: Parallelizing user-defined functions in the ETL workflow using orchestration style sheets. Int. J. Appl. Math. Comput. Sci. 29(1), 69–79 (2019)

    Article  Google Scholar 

  4. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimization of ETL workflows: current state of research and open problems. VLDB J. 26(6), 777–801 (2017). https://doi.org/10.1007/s00778-017-0477-2

    Article  Google Scholar 

  5. Ali, S.M.F., Wrembel, R.: Towards a cost model to optimize user-defined functions in an ETL workflow based on user-defined performance metrics. In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 441–456. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_27

    Chapter  Google Scholar 

  6. Ali, S.M.F., Wrembel, R.: Framework to optimize data processing pipelines using performance metrics. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 131–140. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_11

    Chapter  Google Scholar 

  7. Andrzejewski, W., Bebel, B., Boiński, P., Sienkiewicz, M., Wrembel, R.: Text similarity measures in a data deduplication pipeline for customers records. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 33–42. CEUR-WS.org (2023)

    Google Scholar 

  8. Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: PRESISTANT: learning based assistant for data pre-processing. Data Knowl. Eng. 123, 101727 (2019)

    Article  Google Scholar 

  9. Bode, J., Kühl, N., Kreuzberger, D., Hirschl, S., Holtmann, C.: Data mesh: best practices to avoid the data mess. CoRR, abs/2302.01713 (2023)

    Google Scholar 

  10. Bodziony, M., Krzyzanowski, H., Pieta, L., Wrembel, R.: On discovering semantics of user-defined functions in data processing workflows. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 7:1–7:6. ACM (2021)

    Google Scholar 

  11. Bodziony, M., Morawski, R., Wrembel, R.: Evaluating push-down on nosql data sources: experiments and analysis paper. In: International Workshop on Big Data in Emergent Distributed Environments (BiDEDE) @ SIGMOD/PODS, pp. 4:1–4:6 (2022)

    Google Scholar 

  12. Bodziony, M., Roszyk, S., Wrembel, R.: On evaluating performance of balanced optimization of ETL processes for streaming data sources. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 2572 of CEUR Workshop Proceedings, pp. 74–78 (2020)

    Google Scholar 

  13. Bodziony, M., Wrembel, R.: Reference architecture for running large scale data integration experiments. In: Strauss, C., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2021. LNCS, vol. 12923, pp. 3–9. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86472-9_1

    Chapter  Google Scholar 

  14. Bodziony, M., Wrembel, R.: Data source connectors layer as a service - design patterns. In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 76–80. CEUR-WS.org (2023)

    Google Scholar 

  15. Boiński, P., Andrzejewski, W., Bębel, B., Wrembel, R.: On tuning the sorted neighborhood method for record comparisons in a data deduplication pipeline. In: International Conference on Database and Expert Systems Applications (DEXA). Springer, Cham (2023). Volume to appear of LNCS

    Google Scholar 

  16. Boinski, P., Sienkiewicz, M., Bebel, B., Wrembel, R., Galezowski, D., Graniszewski, W.: On customer data deduplication: lessons learned from a R&D project in the financial sector. In Workshops of the EDBT/ICDT Joint Conference, volume 3135 of CEUR Workshop Proceedings (2022)

    Google Scholar 

  17. Bouguettaya, A., Benatallah, B., Elmargamid, A.: Interconnecting Heterogeneous Information Systems. Kluwer Academic Publishers, Alphen aan den Rijn (1998). ISBN: 0792382161

    Book  Google Scholar 

  18. Brezany, P., Tjoa, A.M., Wanek, H., Wöhrer, A.: Mediators in the architecture of grid information systems. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Waśniewski, J. (eds.) PPAM 2003. LNCS, vol. 3019, pp. 788–795. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24669-5_103

    Chapter  Google Scholar 

  19. Chen, X., et al.: Leon: a new framework for ml-aided query optimization. VLDB Endowment 16(9), 2261–2273 (2023)

    Article  Google Scholar 

  20. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53(6), 127:1-127:42 (2021)

    Article  Google Scholar 

  21. Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: International Conference on Management of Data (SIGMOD), pp. 2201–2206. ACM (2016)

    Google Scholar 

  22. Dehghani, Z.: Data Mesh: Delivering Data-Driven Value at Scale. O’Reilly, Newton (2022). ISBN: 1492092398

    Google Scholar 

  23. DICOM. Dicom - digital imaging and communications in medicine. https://www.dicomstandard.org/

  24. Elmagarmid, A., Rusinkiewicz, M., Sheth, A.: Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann Publishers, Burlington (1999). ISBN: 1-55860-216-X

    Google Scholar 

  25. Errami, S.A., Hajji, H., Kadi, K.A.E., Badir, H.: Spatial big data architecture: from data warehouses and data lakes to the Lakehouse. J. Parallel Distrib. Comput. 176, 70–79 (2023)

    Article  Google Scholar 

  26. Fivetrain. Connectors for every data source. Accessed June 2023

    Google Scholar 

  27. Friedman, E., Pawlowski, P., Cieslewicz, J.: SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB Endowment 2(2), 1402–1413 (2009)

    Article  Google Scholar 

  28. Gillet, A., Leclercq, É., Cullot, N.: Lambda+, the renewal of the lambda architecture: category theory to the rescue. In: La Rosa, M., Sadiq, S., Teniente, E. (eds.) CAiSE 2021. LNCS, vol. 12751, pp. 381–396. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-79382-1_23

    Chapter  Google Scholar 

  29. Giovanelli, J., Bilalli, B., Abelló, A.: Data pre-processing pipeline generation for AutoETL. Inf. Syst. 108, 101957 (2022)

    Article  Google Scholar 

  30. Große, P., May, N., Lehner, W.: A study of partitioning and parallel UDF execution with the SAP HANA database. In; Conference on Scientific and Statistical Database Management (SSDBM), p. 36 (2014)

    Google Scholar 

  31. Gupta, A., Mumick, I.S.: Materialized Views: Techniques, Implementations, and Applications. The MIT Press, Cambridge (1999)

    Book  Google Scholar 

  32. Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems (2023)

    Google Scholar 

  33. Halasipuram, R., Deshpande, P.M., Padmanabhan, S.: Determining essential statistics for cost based optimization of an ETL workflow. In: International Conference on Extending Database Technology (EDBT), pp. 307–318 (2014)

    Google Scholar 

  34. Harby, A.A., Zulkernine, F.: From data warehouse to Lakehouse: a comparative review. In: IEEE International Conference on Big Data, pp. 389–395 (2022)

    Google Scholar 

  35. Heidsieck, G., de Oliveira, D., Pacitti, E., Pradal, C., Tardieu, F., Valduriez, P.: Distributed caching of scientific workflows in multisite cloud. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DEXA 2020. LNCS, vol. 12392, pp. 51–65. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59051-2_4

    Chapter  Google Scholar 

  36. Hernández, Á.B., Pérez, M.S., Gupta, S., Muntés-Mulero, V.: Using machine learning to optimize parallelism in big data applications. Future Gener. Comput. Syst. 86, 1076–1092 (2018)

    Article  Google Scholar 

  37. Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: Conference on Innovative Data Systems Research CIDR, pp. 261–272 (2011)

    Google Scholar 

  38. Hueske, F., et al.: Peeking into the optimization of data flow programs with mapreduce-style UDFs. In: International Conference on Data Engineering (ICDE), pp. 1292–1295 (2013)

    Google Scholar 

  39. Hueske, F., et al.: Opening the black boxes in data flow optimization. VLDB Endowment 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  40. IBM. IBM Cloud Pak for Data: Supported data sources. Accessed June 2023

    Google Scholar 

  41. IBM: Introduction to InfoSphere DataStage balanced optimization. Documentation. Accessed June 2023

    Google Scholar 

  42. Informatica: Pushdown optimization overview. Documentation. Accessed June 2023

    Google Scholar 

  43. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. Springer, Cham (2003). https://doi.org/10.1007/978-3-662-05153-5

    Book  MATH  Google Scholar 

  44. Jemmali, R., Abdelhédi, F., Zurfluh, G.: Dltodw: transferring relational and NoSQL databases from a data lake. SN Comput. Sci. 3(5), 381 (2022)

    Article  Google Scholar 

  45. Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. IEEE Trans. Knowl. Data Eng. 28(5), 1203–1216 (2016)

    Article  Google Scholar 

  46. Karagiannis, A., Vassiliadis, P., Simitsis, A.: Scheduling strategies for efficient ETL execution. Inf. Syst. 38(6), 927–945 (2013)

    Article  Google Scholar 

  47. Kechar, M., Bellatreche, L.: Safeness: suffix arrays driven materialized view selection framework for large-scale workloads. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2022. Lecture Notes in Computer Science, vol. 13428, pp. 74–86. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12670-3_7

    Chapter  Google Scholar 

  48. Konstantinou, N., Paton, N.W.: Feedback driven improvement of data preparation pipelines. Inf. Syst. 92, 101480 (2020)

    Article  Google Scholar 

  49. Kumar, N., Kumar, P.S.: An efficient heuristic for logical optimization of ETL workflows. In: Castellanos, M., Dayal, U., Markl, V. (eds.) BIRTE 2010. LNBIP, vol. 84, pp. 68–83. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22970-1_6

    Chapter  Google Scholar 

  50. Lerner, A., Hussein, R., Ryser, A., Lee, S., Cudré-Mauroux, P.: Networking and storage: the next computing elements in exascale systems? IEEE Data Eng. Bull. 43(1), 60–71 (2020)

    Google Scholar 

  51. Liu, X., Iftikhar, N.: An ETL optimization framework using partitioning and parallelization. In: ACM Symposium on Applied Computing, pp. 1015–1022 (2015)

    Google Scholar 

  52. Munshi, A.A., Mohamed, Y.A.I.: Data lake lambda architecture for smart grids big data analytics. IEEE Access 6, 40463–40471 (2018)

    Article  Google Scholar 

  53. Nargesian, F., Zhu, E., Miller, R.J., Pu, K.Q., Arocena, P.C.: Data lake management: challenges and opportunities. VLDB Endowment 12(12), 1986–1989 (2019)

    Article  Google Scholar 

  54. Owaida, M., Alonso, G., Fogliarini, L., Hock-Koon, A., Melet, P.: Lowering the latency of data processing pipelines through FPGA based hardware acceleration. VLDB Endowment 13(1), 71–85 (2019)

    Article  Google Scholar 

  55. Popescu, A.D., Ercegovac, V., Balmin, A., Branco, M., Ailamaki, A.: Same queries, different data: can we predict runtime performance? In: Workshops @ International Conference on Data Engineering (ICDE), pp. 275–280. IEEE Computer Society (2012)

    Google Scholar 

  56. Quemy, A.: Binary classification in unstructured space with hypergraph case-based reasoning. Inf. Syst. 85, 92–113 (2019)

    Article  Google Scholar 

  57. Ramachandra, K., Park, K., Emani, K.V., Halverson, A., Galindo-Legaria, C.A., Cunningham, C.: Froid: optimization of imperative programs in a relational database. VLDB Endowment 11(4), 432–444 (2017)

    Article  Google Scholar 

  58. Rheinländer, A., Heise, A., Hueske, F., Leser, U., Naumann, F.: SOFA: an extensible logical optimizer for UDF-heavy data flows. Inf. Syst. 52, 96–125 (2015)

    Article  Google Scholar 

  59. Romero, O., Wrembel, R.: Data engineering for data science: two sides of the same coin. In: Song, M., Song, I.-Y., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2020. LNCS, vol. 12393, pp. 157–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59065-9_13

    Chapter  Google Scholar 

  60. Rusinkiewicz, M., Czejdo, B., Embley, D.W.: An implementation model for muldidatabase queries. In: Karagiannis, D. (ed.) Database and Expert Systems Applications, pp. 309–314. Springer-Verlag, Vienna (1991). https://doi.org/10.1007/978-3-7091-7555-2_52

    Chapter  Google Scholar 

  61. Sichert, M., Neumann, T.: User-defined operators: efficiently integrating custom algorithms into modern databases. VLDB Endowment 15(5), 1119–1131 (2022)

    Article  Google Scholar 

  62. Sienkiewicz, M., Wrembel, R.: Managing data in a big financial institution: conclusions from a R&D project. In: Workshops of the EDBT/ICDT Joint Conference, vol. 2841 (2021)

    Google Scholar 

  63. Simitsis, A., Skiadopoulos, S., Vassiliadis, P.: The history, present, and future of ETL technology (invited). In: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), volume 3369 of CEUR Workshop Proceedings, pp. 3–12. CEUR-WS.org (2023)

    Google Scholar 

  64. Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing ETL processes in data warehouses. In: International Conference on Data Engineering (ICDE), pp. 564–575. IEEE Computer Society (2005)

    Google Scholar 

  65. Simitsis, A., Vassiliadis, P., Sellis, T.K.: State-space optimization of ETL workflows. IEEE Trans. Knowl. Data Eng. 17(10), 1404–1419 (2005)

    Article  Google Scholar 

  66. Strengholt, P.: Data Management at Scale: Modern Data Architecture with Data Mesh and Data Fabric. O’Reilly, Newton (2023). ISBN: 1098138864

    Google Scholar 

  67. Tan, R., Chirkova, R., Gadepally, V., Mattson, T.G.: Enabling query processing across heterogeneous data models: a survey. In: IEEE International Conference on Big Data, pp. 3211–3220 (2017)

    Google Scholar 

  68. Thomsen, C.: ETL. In Encyclopedia of Big Data Technologies, Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77525-8

    Book  Google Scholar 

  69. Tsesmelis, D., Simitsis, A.: Database optimizers in the era of learning. In: International Conference on Data Engineering (ICDE), pp. 3213–3216 (2022)

    Google Scholar 

  70. Vaisman, A.A., Zimányi, E.: Data Warehouse Systems - Design and Implementation. Data-Centric Systems and Applications, 2nd edn. Springer (2022). https://doi.org/10.1007/978-3-662-65167-4

    Book  Google Scholar 

  71. Wiederhold, G.: Mediators in the architecture of future information systems. Computer 25(3), 38–49 (1992)

    Article  Google Scholar 

  72. Witt, C., Bux, M., Gusew, W., Leser, U.: Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf. Syst. 82, 33–52 (2019)

    Article  Google Scholar 

  73. Zaharia, M., Ghodsi, A., Xin, R., Armbrust, M.: Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In: Conference on Innovative Data Systems Research (CIDR) (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Wrembel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wrembel, R. (2023). Data Integration Revitalized: From Data Warehouse Through Data Lake to Data Mesh. In: Strauss, C., Amagasa, T., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2023. Lecture Notes in Computer Science, vol 14146. Springer, Cham. https://doi.org/10.1007/978-3-031-39847-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39847-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39846-9

  • Online ISBN: 978-3-031-39847-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics